IDR / idr-utils

Utility scripts for managing IDR submissions
BSD 2-Clause "Simplified" License
2 stars 6 forks source link

Refactor parsing utility to generate JSON reports of the study files #48

Closed sbesson closed 8 months ago

sbesson commented 2 years ago

As part of the ongoing investigation around landing pages, this PR reviews the study_parser utility which contains most of the logic to parse the study file and generate the OMERO annotations at the Project/Screen level. The main limitation at the moment is the outcome of the --report option is very tailored to the way the OMERO annotations are constructed.

To achiebe this, this PR focuses on separating the file parsing logic from the formatting logic to allow us to create different representations:

Potential next steps:

sbesson commented 2 years ago

To ensure this does not cause regression in the submission workflow, ed34234 was executed in check-mode against a copy of the current production DB with

find /uod/idr/metadata/ -maxdepth 2 -iname *study.txt -exec sh -c "echo {} && /opt/omero/server/venv3/bin/python pyidr/study_parser.py {} --check -q" \;

A few studies reported mismatching annotations (idr0114, idr0068, idr0043, idr0005). After closer examination, the mismatch is expected as the latest changes e.g. new DOIs, new annotation files or new GitHub repository have not been propagated to the production DB. /cc @dominikl @francesw

sbesson commented 2 years ago

Otherwise, studies.zip contains the JSON generated by python pyidr/study_parser.py --report for a few studies of various types with ed34234

sbesson commented 2 years ago

With the last commit (API still needs to be improved), studies.zip contains a new version of JSON generated for most of the published studies with all the multi-value keys now split as lists e.g.

(base) sbesson@ls30630:idr-utils (study_json) $ cat studies/idr0016.json 
{
    "Experiments": [],
    "Screens": [
        {
            "Screen Description": "Using the cell-painting assay developed by Gustafsdottir et al, 2013, the Broad Institute has assembled a reference dataset of profiles for U2OS osteosarcoma cells treated with ~30,000 compounds. The experiment consisted of 413 microtiter plates. Each plate has 384 wells. Each well has 6 fields of view; a very small fraction of the wells (0.002%) have a few sites missing. Each field was imaged in five channels (detection wavelengths), and each channel is stored as a separate, grayscale 16-bit TIFF image file.",
            "Screen Imaging Method": [
                "fluorescence microscopy"
            ],
            "Screen Name": "idr0016-wawer-bioactivecompoundprofiling/screenA",
            "Screen Number": "1",
            "Screen Sample Type": [
                "cell"
            ],
            "Screen Technology Type": [
                "compound screen"
            ],
            "Screen Type": [
                "primary screen"
            ]
        }
    ],
    "Study Accession": "idr0016",
    "Study Author List": [
        "Wawer MJ, Li K, Gustafsdottir SM, Ljosa V, Bodycombe NE, Marton MA, Sokolnicki KL, Bray MA, Kemp MM, Winchester E, Taylor B, Grant GB, Hon CS, Duvall JR, Wilson JA, Bittker JA, Dan\u010d\u00edk V, Narayan R, Subramanian A, Winckler W, Golub TR, Carpenter AE, Shamji AF, Schreiber SL, Clemons PA",
        "Bray MA, Gustafsdottir SM, Rohban MH, Singh S, Ljosa V, Sokolnicki KL, Bittker JA, Bodycombe NE, Danc\u00edk V, Hasaka TP, Hon CS, Kemp MM, Li K, Walpita D, Wawer MJ, Golub TR, Schreiber SL, Clemons PA, Shamji AF, Carpenter AE"
    ],
    "Study Copyright": "Wawer et al",
    "Study DOI": [
        "http://dx.doi.org/10.1073/pnas.1410933111",
        "http://dx.doi.org/10.1093/gigascience/giw014"
    ],
    "Study Description": "Phenotypic profiling attempts to summarize multiparametric, feature-based analysis of cellular phenotypes of each sample so that similarities between profiles reflect similarities between samples. This image set provides a basis for testing image-based profiling methods wrt. to their ability to distinguish the effects of small molecules. The images are of U2OS cells treated with each of over 30,000 known bioactive compounds and labeled with six labels that characterize seven organelles (the cell-painting assay). Gustafsdottir et al. (doi:10.1371/journal.pone.0080999) have developed a multiplex cytological profiling assay that \"paints the cell\" with as many fluorescent markers as possible without compromising our ability to extract rich, quantitative profiles in high throughput. The assay detects seven major cellular components. In a pilot screen of bioactive compounds, the assay detected a range of cellular phenotypes and it clustered compounds with similar annotated protein targets or chemical structure based on cytological profiles. The results demonstrate that the assay captures subtle patterns in the combination of morphological labels, thereby detecting the effects of chemical compounds even though their targets are not stained directly. This image-based assay provides an unbiased approach to characterize compound- and disease-associated cell states to support future probe discovery. Using the cell-painting assay, the Broad Institute has assembled a reference dataset of profiles for U2OS osteosarcoma cells treated with ~30,000 compounds. The compound collection includes DOS-derived compounds (20,000), as well as chemically diverse MLI compounds with biologically diverse performance identified through analysis of PubChem (10,000), and known bioactive compounds to serve as landmarks (2,500). The DOS compounds consist of structurally diverse and stereochemically rich compounds with structures distinct from the current MLSMR. The compound collection also includes 267 distinct compounds nominated by MLPCN Centers from projects for which the Centers would like to identify new chemical series with similar activities.",
    "Study External URL": [
        "http://www.cellimagelibrary.org/pages/project_20269",
        "http://gigadb.org/dataset/100200"
    ],
    "Study License": "CC0 1.0",
    "Study License URL": "https://creativecommons.org/publicdomain/zero/1.0/",
    "Study Name": "idr0016-wawer-bioactivecompoundprofiling",
    "Study Organism": [
        "Homo sapiens"
    ],
    "Study Organism Term Accession": [
        "NCBITaxon_9606"
    ],
    "Study Organism Term Source REF": [
        "NCBITaxon"
    ],
    "Study PMC ID": [
        "PMC4121832",
        "PMC5721342"
    ],
    "Study Person Address": [
        "Broad Institute of Harvard and MIT, Cambridge, MA, USA",
        "Broad Institute of Harvard and MIT, Cambridge, MA, USA"
    ],
    "Study Person Email": [
        "anne@broadinstitute.org",
        "pclemons@broadinstitute.org."
    ],
    "Study Person First Name": [
        "Anne",
        "Paul"
    ],
    "Study Person Last Name": [
        "Carpenter",
        "Clemons"
    ],
    "Study Person Roles": [
        "submitter",
        "submitter"
    ],
    "Study PubMed ID": [
        "25024206",
        "28327978"
    ],
    "Study Public Release Date": "2016-12-16",
    "Study Publication Title": [
        "Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling.",
        "A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay."
    ],
    "Study Screens Number": "1",
    "Study Title": "Human U2OS cells - compound cell-painting experiment",
    "Study Type": [
        "high content screen"
    ],
    "Study Type Term Accession": [
        "EFO_0007550"
    ],
    "Study Type Term Source REF": [
        "EFO"
    ],
    "Study Version History": "In August 2017 the 20 plates from the Broad Bioimaging Benchmark Collection dataset BBBC022 were moved to a separate screen, idr0036/screenA. At the same time 58 new plates, which had become available from the Cell Image Library (http://www.cellimagelibrary.org/pages/project_20269) after the original import to idr0016, were added to idr0016 and annotation was added to plate 25723 (https://idr.openmicroscopy.org/webclient/?show=plate-5104) which was previously unannotated. Protocols have been updated to reflect the materials and methods information in Wawer et al 2014.",
    "Term Source Name": [
        "NCBITaxon",
        "EFO",
        "CMPO",
        "FBbi"
    ],
    "Term Source URI": [
        "http://purl.obolibrary.org/obo/",
        "http://www.ebi.ac.uk/efo/",
        "http://www.ebi.ac.uk/cmpo/",
        "http://purl.obolibrary.org/obo/"
    ]
}
sbesson commented 2 years ago

With the last commit, running the parser + OMERO formatter in check-mode on test104

[sbesson@test104-omeroreadwrite idr-utils]$  find /uod/idr/metadata/ -maxdepth 2 -iname *study.txt -exec sh -c "echo {} && /opt/omero/server/venv3/bin/python pyidr/study_parser.py {} --check -q" \;
/uod/idr/metadata/idr0001-graml-sysgro/idr0001-study.txt
/uod/idr/metadata/idr0002-heriche-condensation/idr0002-study.txt
/uod/idr/metadata/idr0003-breker-plasticity/idr0003-study.txt
/uod/idr/metadata/idr0004-thorpe-rad52/idr0004-study.txt
/uod/idr/metadata/idr0006-fong-nuclearbodies/idr0006-study.txt
/uod/idr/metadata/idr0007-srikumar-sumo/idr0007-study.txt
/uod/idr/metadata/idr0008-rohn-actinome/idr0008-study.txt
/uod/idr/metadata/idr0009-simpson-secretion/idr0009-study.txt
/uod/idr/metadata/idr0010-doil-dnadamage/idr0010-study.txt
/uod/idr/metadata/idr0011-ledesmafernandez-dad4/idr0011-study.txt
/uod/idr/metadata/idr0012-fuchs-cellmorph/idr0012-study.txt
/uod/idr/metadata/idr0013-neumann-mitocheck/idr0013-study.txt
/uod/idr/metadata/idr0015-colin-taraoceans/idr0015-study.txt
ERROR:pyidr.study_parser.Formatter:Mismatching annotation
/uod/idr/metadata/idr0016-wawer-bioactivecompoundprofiling/idr0016-study.txt
/uod/idr/metadata/idr0017-breinig-drugscreen/idr0017-study.txt
/uod/idr/metadata/idr0018-neff-histopathology/idr0018-study.txt
/uod/idr/metadata/idr0019-sero-nfkappab/idr0019-study.txt
ERROR:pyidr.study_parser.Formatter:Mismatching annotation
/uod/idr/metadata/idr0020-barr-chtog/idr0020-study.txt
/uod/idr/metadata/idr0021-lawo-pericentriolarmaterial/idr0021-study.txt
/uod/idr/metadata/idr0022-koedoot-cellmigration/idr0022-study.txt
/uod/idr/metadata/idr0023-szymborska-nuclearpore/idr0023-study.txt
/uod/idr/metadata/idr0025-stadler-proteinatlas/idr0025-study.txt
/uod/idr/metadata/idr0026-weigelin-immunotherapy/idr0026-study.txt
/uod/idr/metadata/idr0027-dickerson-chromatin/idr0027-study.txt
/uod/idr/metadata/idr0028-pascualvargas-rhogtpases/idr0028-study.txt
/uod/idr/metadata/idr0030-sero-yap/idr0030-study.txt
/uod/idr/metadata/idr0032-yang-meristem/idr0032-study.txt
/uod/idr/metadata/idr0033-rohban-pathways/idr0033-study.txt
/uod/idr/metadata/idr0034-kilpinen-hipsci/idr0034-study.txt
/uod/idr/metadata/idr0035-caie-drugresponse/idr0035-study.txt
/uod/idr/metadata/idr0036-gustafsdottir-cellpainting/idr0036-study.txt
/uod/idr/metadata/idr0037-vigilante-hipsci/idr0037-study.txt
/uod/idr/metadata/idr0038-held-kidneylightsheet/idr0038-study.txt
/uod/idr/metadata/idr0040-aymoz-singlecell/idr0040-study.txt
/uod/idr/metadata/idr0041-cai-mitoticatlas/idr0041-study.txt
/uod/idr/metadata/idr0042-nirschl-wsideeplearning/idr0042-study.txt
/uod/idr/metadata/idr0043-uhlen-humanproteinatlas/idr0043-study.txt
/uod/idr/metadata/idr0044-mcdole-tardislightsheet/idr0044-study.txt
/uod/idr/metadata/idr0045-reichmann-zygotespindle/idr0045-study.txt
/uod/idr/metadata/idr0047-neuert-yeastmrna/idr0047-study.txt
/uod/idr/metadata/idr0048-abdeladim-chroms/idr0048-study.txt
/uod/idr/metadata/idr0050-springer-cytoskeletalsystems/idr0050-study.txt
/uod/idr/metadata/idr0051-fulton-tailbudlightsheet/idr0051-study.txt
/uod/idr/metadata/idr0052-walther-condensinmap/idr0052-study.txt
/uod/idr/metadata/idr0053-faas-virtualnanoscopy/idr0053-study.txt
/uod/idr/metadata/idr0054-segura-tonsilhyperion/idr0054-study.txt
/uod/idr/metadata/idr0056-stojic-lncrnas/idr0056-study.txt
/uod/idr/metadata/idr0061-wolf-spindlepositioning/idr0061-study.txt
/uod/idr/metadata/idr0062-blin-nuclearsegmentation/idr0062-study.txt
/uod/idr/metadata/idr0063-newman-chromosomalloci/idr0063-study.txt
/uod/idr/metadata/idr0064-goglia-erkdynamics/idr0064-study.txt
/uod/idr/metadata/idr0065-camsund-crispri/idr0065-study.txt
/uod/idr/metadata/idr0066-voigt-mesospim/idr0066-study.txt
/uod/idr/metadata/idr0067-king-yeastmeiosis/idr0067-study.txt
/uod/idr/metadata/idr0069-caldera-perturbome/idr0069-study.txt
/uod/idr/metadata/idr0070-kerwin-hdbr/idr0070-study.txt
/uod/idr/metadata/idr0071-feldman-crisprko/idr0071-study.txt
/uod/idr/metadata/idr0072-schormann-subcellref/idr0072-study.txt
/uod/idr/metadata/idr0073-schaadt-immuneinfiltrates/idr0073-study.txt
/uod/idr/metadata/idr0075-cabirol-honeybee/idr0075-study.txt
/uod/idr/metadata/idr0076-ali-metabric/idr0076-study.txt
/uod/idr/metadata/idr0077-valuchova-flowerlightsheet/idr0077-study.txt
/uod/idr/metadata/idr0078-mattiazziusaj-endocyticcomp/idr0078-study.txt
/uod/idr/metadata/idr0079-hartmann-lateralline/idr0079-study.txt
/uod/idr/metadata/idr0080-way-perturbation/idr0080-study.txt
/uod/idr/metadata/idr0081-georgi-adenovirus/idr0081-study.txt
/uod/idr/metadata/idr0082-pennycuick-lesions/idr0082-study.txt
/uod/idr/metadata/idr0083-lamers-sarscov2/idr0083-study.txt
/uod/idr/metadata/idr0084-oudelaar-alphaglobin/idr0084-study.txt
/uod/idr/metadata/idr0085-walsh-mfhrem/idr0085-study.txt
/uod/idr/metadata/idr0086-miron-micrographs/idr0086-study.txt
/uod/idr/metadata/idr0087-paci-nuclearimport/idr0087-study.txt
/uod/idr/metadata/idr0088-cox-phenomicprofiling/idr0088-study.txt
/uod/idr/metadata/idr0089-fischl-coldtemp/idr0089-study.txt
/uod/idr/metadata/idr0090-ashdown-malaria/idr0090-study.txt
/uod/idr/metadata/idr0091-julou-lacinduction/idr0091-study.txt
/uod/idr/metadata/idr0092-ostrop-organoid/idr0092-study.txt
/uod/idr/metadata/idr0093-mueller-perturbation/idr0093-study.txt
/uod/idr/metadata/idr0094-ellinger-sarscov2/idr0094-study.txt
/uod/idr/metadata/idr0095-ali-asymmetry/idr0095-study.txt
/uod/idr/metadata/idr0097-reicher-proteintag/idr0097-study.txt
/uod/idr/metadata/idr0098-huang-octmos/idr0098-study.txt
/uod/idr/metadata/idr0099-jain-beetlelightsheet/idr0099-study.txt
/uod/idr/metadata/idr0100-capar-myelin/idr0100-study.txt
/uod/idr/metadata/idr0103-coomer-hiv1fusion/idr0103-study.txt
/uod/idr/metadata/idr0106-kubota-lunglightsheet/idr0106-study.txt
/uod/idr/metadata/idr0109-zaritsky-melanoma/idr0109-study.txt
/uod/idr/metadata/idr0114-lindsay-hdbr/idr0114-study.txt
/uod/idr/metadata/idr0107-morgan-hei10/idr0107-study.txt
/uod/idr/metadata/idr0111-lee-cellmigration/idr0111-study.txt
/uod/idr/metadata/idr0110-rodermund-xistrna/idr0110-study.txt
/uod/idr/metadata/idr0113-bottes-opcclones/idr0113-study.txt
/uod/idr/metadata/idr0108-sabinina-nuclearporecomplex/idr0108-study.txt
/uod/idr/metadata/idr0117-croce-marimba/idr0117-study.txt
/uod/idr/metadata/idr0112-verzat-motorneurons/idr0112-study.txt
/uod/idr/metadata/idr0101-payne-insitugenomeseq/idr0101-study.txt
/uod/idr/metadata/idr0116-deboer-npod/idr0116-study.txt
/uod/idr/metadata/idr0005-toret-adhesion/idr0005-study.txt
/uod/idr/metadata/idr0118-keenan-flylightsheet/idr0118-study.txt
/uod/idr/metadata/idr0096-tratwal-marrowquant/idr0096-study.txt
/uod/idr/metadata/idr0068-shah-zebrafishlightsheet/idr0068-study.txt
/uod/idr/metadata/idr0124-esteban-heartmorphogenesis/idr0124-study.txt

The two mismatching annotations are related to undefined Screen Type values in idr0015 and idr0019 which are currently generated keys with empty values in the map annotation.

The workflow artifacts in https://github.com/IDR/idr-metadata/actions/runs/1745648670 contain the JSON representation of all study files.

As a lot of refactoring already happened and the annotation is largely unmodified barring the Screen Type element discussed above, proposing to evaluate these changes as they stand. Follow-up PRs can then focus on refining the JSON formatting of the study files.

will-moore commented 2 years ago

Looks good. It would be nice if "Study Author List" was actually a list with 1 author per item. All studies have just a single multi-author item in the list, except idr0016 which has 2 items in the list.

Then we add these as JSON attachments to each study (one or all of the Projects/Screens named idr0NNN)? And then what's the API used to retrieve them? An IDR-specific URL (in what web app?) e.g. /gallery/idr/study/0123/json/ or something generic like /webclient/file_annotation/?name=idr0123.json ?

sbesson commented 2 years ago

All studies have just a single multi-author item in the list, except idr0016 which has 2 items in the list.

idr0035 has 5 associated publications, and hence 5 separate author lists.

Thinking on how to turn this into Linked Data, the Author concept might be a good one as it can be mapped to existing types like schema.org/Person. We might need however to differentiate two author concepts:

In terms of metadata, the latter one offers more advantages and contains more information.

Then we add these as JSON attachments to each study (one or all of the Projects/Screens named idr0NNN)?

Attaching would be easy but as you mention we might want to define how people should consume these. I was initially trying to build a JSON representation that would contain all necessary information to generate study landing pages.

sbesson commented 2 years ago

Coming back (arguably lately to it), I am in-between two minds about this feature. Looking at the files diff, it is a substantial diff and although the description tries to capture the set of changes, I would need to review/retest it more closely to make sure I would be happy for this script to be used in production.

As indicated in the PR body, the context of this investigation was the generation of study landing pages. Since then, we decided to focus our efforts on the development of the search engine/UI. So there is obviously some concerns about pushing a substantial rewrite without a clear consumer or more people looking at.

If we agree it is safer not to merge this in, options are either to keep this open until we come back to it or close and convert into an issue.