FAIRiCUBE / resource-metadata

manage information for processing/analysis resources, specifically: issue form to collect md requirements, issue template to manage codelists
https://fairicube.github.io/resource-metadata/
0 stars 0 forks source link

generated stac objects type #16

Closed baloola closed 7 months ago

baloola commented 1 year ago

The current supported structure of stac api is a catalog(root) that contains collections, then each collection contains its items. ( Catalog (root) -> Collections -> Items)

Which means if we have a resource-metadata collection, no sub-collection would be enabled in the stac api.

I can see that the resource-metadata stac objects are generated as collections "ml_collection and no_ml_collection" which means they cannot be ingested as sub-collections, I would propose to generate a one resource-metadata collection and then add the rest of objects to it as items.

KathiSchleidt commented 1 year ago

Do I understand correctly that while STAC can formally nest catalogs and collections in any way, the api only supports the simplified hierarchy of Catalog (root) -> Collections -> Items? No Collections of Collections?

To my memory, most of the datasets available on the existing FAIRiCUBE catalogue (taken from EDC) were described as Collections already (wasn't quite sure why, working assumption was that as they were mostly time series, a single time slice was seen as an Item, the time series as a Collection), we will not be able to create a group of dataset metadata separate from a/p resource metadata?

baloola commented 1 year ago

exactly, you are correct regarding the api structure. As for the EDC collections, I assume ingesting the collection's items instead would be a valid workaround.

sMorrone commented 1 year ago

dear @baloola while I am not that happy with this structure, I am afraid we have no choice. Let's agree on the following: we have one FiC catalog with 3 collections: dataset, ML resources and non-ML resources . Probably you could just take the related jsons (https://fairicube.github.io/resource-metadata/ml_collection.json and https://fairicube.github.io/resource-metadata/no_ml_collection.json) and include as child in the root FiC catalog?

baloola commented 1 year ago

Hi @sMorrone having 3 collections can be a workaround for now. I will try and adjust ML resources and non-ML resources jsons and create their collections (not sure if this will work), However, it will be better if both jsons are created as root catalogs "with the correct links and relations". e.g ml-collection.josn (note that root links and self links are the same):

{
    "type": "Catalog",
    "stac_version": "1.0.0",
    "stac_extensions": [],
    "title": "ML collection",
    "id": "ML collection",
    "description": "An example of ML collection.",
    "license": "various",
    "keywords": [
        "Machine Learning",
        "Deep Learning",
        "Artificial Intelligence"
    ],
    "links": [
        {
            "href": "https://fairicube.github.io/resource-metadata/ml_collection.json",
            "rel": "self",
            "type": "application/json"
        },
        {
            "href": "https://fairicube.github.io/resource-metadata/ml_collection.json",
            "rel": "root",
            "type": "application/json"
        },
        {
            "href": "https://fairicube.github.io/resource-metadata/Example 1 of D4.3 - Deep Learning.json",
            "rel": "item",
            "type": "application/geo+json"
        }
    ]
}

I also noticed that the links in the generated stac items are relative (e.g: "href": "./ml_collection.json"). I don't know if the deployed stac-fastapi resolves the links or not, if it does then that's not a problem, otherwise absulote hrefs are needed.

cozzolinoac11 commented 1 year ago

Hi @baloola we have seen that two collections ("ML collection" and "no-ML collection") have been created, but that the items we had uploaded in the resource-metadata/stac folder are no longer visible using the browser. This probably happens because different collections are used than the ones in the GitHub folder. Can you please have a look and fix the issue?

baloola commented 1 year ago

Hi @cozzolinoac11 Yes I manually created the two collections, but regarding the items, I tried to ingest them, but they are invalid items, since they are missing some required fileds (geometry & bbox). stac-fastapi validates ingested items against items-specs before ingesting them.

I have edited Example 3 of D4.3 - Pre-processing.json ( I additionally added datacube extention object to it ) and you can see it now in the browser here.

the edited ingested json:

{
    "type": "Feature",
    "stac_version": "1.0.0",
    "id": "JPEG_to_numpy_transformation",
    "collection": "no-ML collection",
    "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -180.0,
          -90.0
        ],
        [
          -180.0,
          90.0
        ],
        [
          180.0,
          90.0
        ],
        [
          180.0,
          -90.0
        ],
        [
          -180.0,
          -90.0
        ]
      ]
    ]
  },
    "properties": {
        "cube:dimensions": {
            "x": {
                "axis": "x",
                "extent": [
                -180,
                180
                ],
                "reference_system": "ESPG:4326",
                "type": "spatial"
            },
            "y": {
                "axis": "y",
                "extent": [
                -90,
                90
                ],
                "reference_system": "ESPG:4326",
                "type": "spatial"
            },
            "time": {
                "values": [
                "2000-01-01T00:00:00"
                ],
                "type": "temporal"
            }
        },
        "title": "JPEG images to numpy array transformation",
        "description": "Building dataset as numpy array. In machine learning, Python uses image data in the format of a NumPy array, i.e., [Height, Width, Channel] format. Therefore, the images must be transformed in this format. In this case, the images are in JPEG format and, through pillow, NumPy and OpenCV functions, the transformation is performed. The cv2 package (OpenCV) has the method imread() which is used to load the image and it also reads the given image (PIL image) in the NumPy array format. Because the images within the dataset (i.e., the NumPy arrays) must all be the same size to be used, and as a matter of efficiency and calculation power, using cv2s resize() the images are resized from 350x350 pixels into 100x100 (this dimension can be easily changed). The channel is three because the images are RGB. This method then returns a dataset containing the images in the format of NumPy arrays and their respective class labels.",
        "main-category": "Pre-processing",
        "objective": "data-transformation",
        "datetime": "2023-08-05",
        "keywords": [
            "numpy array",
            "data transformation",
            "jpeg"
        ],
        "platform": "Google Colab",
        "framework": "OpenCV",
        "algorithm": "custom-method",
        "license": "CC-BY-4.0",
        "processor-used": "cpu",
        "operating-system-used": "linux",
        "use-constraints": "no Constraint of Use"
    },
    "links": [
        {
            "rel": "root",
            "href": "./index.json",
            "type": "application/json",
            "title": "Root Catalog"
        },
        {
            "rel": "parent",
            "href": "./no_ml_collection.json",
            "type": "application/json",
            "title": "no_ml_collection"
        },
        {
            "rel": "collection",
            "href": "./no_ml_collection.json",
            "type": "application/json",
            "title": "no_ml_collection"
        },
        {
            "href": "https://github.com/cozzolinoac11/wildfire_prediction/blob/main/img_to_NPY_transformation.ipynb",
            "rel": "about",
            "type": "text/html",
            "title": "Example-1"
        }
    ],
    "assets": {
        "input-data-used": {
            "href": "https://open.canada.ca/data/en/dataset/9d8f219c-4df0-4481-926f-8a2a532ca003",
            "type": "application/json",
            "title": "Input data used",
            "description": "Refer to Canadas website for the original wildfires data. The dataset is composed by satellite images (shape is 350x350).",
            "biases-and-ethical-aspects": "",
            "roles": [
                "data"
            ]
        },
        "output-data-obtained": {
            "href": "https://public.epsilon-italia.it/FAIRiCUBE/wildfire-classification/data_numpy.zip",
            "type": "application/json",
            "title": "Output data obtained",
            "description": "Dataset in format Numpy arrays. The images are resized in 100x100.",
            "roles": [
                "data"
            ]
        }
    },
    "bbox": [
    -180.0,
    -90.0,
    180.0,
    90.0
  ],
  "stac_extensions": [
    "https://stac-extensions.github.io/datacube/v2.0.0/schema.json"
  ]
}
baloola commented 1 year ago

also I had to correct the collection id

cozzolinoac11 commented 1 year ago

Hi @baloola

baloola commented 1 year ago

yes, eventually that will be the procedure.

cozzolinoac11 commented 1 year ago

Hello @baloola , we have a problem with the browser link https://catalog.eoxhub.fairicube.eu/. Specifically, I can access a resource navigating the browser, but when I try to access it directly using the URL, I receive error 404. For example, the resource "JPEG images to numpy array transformation" under no-ML collection can be seen navigating the browser, but https://catalog.eoxhub.fairicube.eu/collections/no-ML%20collection/items/JPEG_to_numpy_transformation returns 404 error. This also happens with other resources. Can you help?

KathiSchleidt commented 1 year ago

@baloola as FAIRiCUBE Hub is now formally operational, I'd appreciate components such as the catalog to be available and functioning!!!

Btw - what's status on dataset metadata? From what I see, the catalog is still restricted to what's already available via EDC

baloola commented 1 year ago

@cozzolinoac11 the issue was casued from the main stac-browser repo, it is fixed now. could you please try accessing the links again and confirm?

cozzolinoac11 commented 1 year ago

Hi @baloola Yes, it works correctly now. Can we then add the item stac-json files to the resource-metadata/stac folder so that we can see them automatically in the catalog?

baloola commented 1 year ago

I'm running a script in github action to inject data-access items, do you want me to do the same here ? If so, maybe you can open an issue and assign me :)

baloola commented 7 months ago

resolved