fractal-analytics-platform / fractal-tasks-core

Main tasks for the Fractal analytics platform
https://fractal-analytics-platform.github.io/fractal-tasks-core/
BSD 3-Clause "New" or "Revised" License
12 stars 6 forks source link

Define ways to access BIA data in tests #565

Open tcompa opened 10 months ago

tcompa commented 10 months ago

As part of #557, I added a small test that fetches data from https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BIAD843, runs import_ome_zarr, and checks some aspect of the new image ROI table -- see code below.

The test runs successfully (see https://github.com/fractal-analytics-platform/fractal-tasks-core/actions/runs/6494220845/job/17636746257?pr=557), but I will now mark it as "skip" (that is, it won't be run in the CI). This is because I don't know

  1. whether FTP access is the preferred one (more info here: https://www.ebi.ac.uk/biostudies/help#download);
  2. whether there may be issues with download-rate limits.

I guess we should get in touch with the BIA team, and ask whether it is OK that we frequently fetch some small data (I think this zip file was approximately 5M-10M) in this way. If not, we can find other ways.


def test_import_ome_zarr_image_BIA(tmp_path):
    """
    This test imports one of the BIA OME-Zarr listed in
    https://www.ebi.ac.uk/biostudies/bioimages/studies/S-BIAD843.

    It is currently marked as "skip", to avoid incurring into download-rate
    limits.
    """

    from ftplib import FTP
    import zipfile
    import anndata as ad
    import numpy as np

    # Download an existing OME-Zarr from BIA
    ftp = FTP("ftp.ebi.ac.uk")
    ftp.login()
    ftp.cwd("biostudies/fire/S-BIAD/843/S-BIAD843/Files")
    fname = "WD1_15-02_WT_confocalonly.ome.zarr.zip"
    with (tmp_path / fname).open("wb") as fp:
        ftp.retrbinary(f"RETR {fname}", fp.write)

    with zipfile.ZipFile(tmp_path / fname, "r") as zip_ref:
        zip_ref.extractall(tmp_path)

    root_path = str(tmp_path)
    zarr_name = "WD1_15-02_WT_confocalonly.zarr/0"

    # Run import_ome_zarr
    metadiff = import_ome_zarr(
        input_paths=[str(root_path)],
        zarr_name=zarr_name,
        output_path="null",
        metadata={},
    )
    metadata = metadiff.copy()
    debug(metadata)

    # Check that table were created
    _check_ROI_tables(f"{root_path}/{zarr_name}")

    # Check image_ROI_table
    g = zarr.open(f"{root_path}/{zarr_name}", mode="r")
    debug(g.attrs.asdict())
    pixel_size_x = g.attrs["multiscales"][0]["datasets"][0][
        "coordinateTransformations"
    ][0]["scale"][
        -1
    ]  # noqa
    debug(pixel_size_x)
    g = zarr.open(f"{root_path}/{zarr_name}/0", mode="r")
    array_shape_x = g.shape[-1]
    debug(array_shape_x)
    EXPECTED_X_LENGTH = array_shape_x * pixel_size_x
    image_ROI_table = ad.read_zarr(
        f"{root_path}/{zarr_name}/tables/image_ROI_table"
    )
    debug(image_ROI_table.X)
    assert np.allclose(
        image_ROI_table[:, "len_x_micrometer"].X[0, 0],
        EXPECTED_X_LENGTH,
    )
tcompa commented 10 months ago

Recent logging of the time spent in Zenodo downloads (see discussion in #568) made it clear that this operation takes a significant amount of time, in the GitHub CI (*). This could get to something like 2 minutes out of the 6 minutes spent in pytest.

Let's keep this aspect in mind if/when we think about integrating other data sources in the CI.

I guess that we would get a significant speed up with a GitHub repo like fractal-test-data, since connections would be all from GitHub (to be verified). This would be doable for our own data (the ones currently on Zenodo), but it's not obvious that we can do it for BIA or other sources.


(*) Note that this is not relevant for local tests, since Zenodo downloads only take place if the corresponding folders were removed from a standard location in tests/data - which is typically not the case.