Closed sbesson closed 8 months ago
To ensure this does not cause regression in the submission workflow, ed34234 was executed in check-mode against a copy of the current production DB with
find /uod/idr/metadata/ -maxdepth 2 -iname *study.txt -exec sh -c "echo {} && /opt/omero/server/venv3/bin/python pyidr/study_parser.py {} --check -q" \;
A few studies reported mismatching annotations (idr0114
, idr0068
, idr0043
, idr0005
). After closer examination, the mismatch is expected as the latest changes e.g. new DOIs, new annotation files or new GitHub repository have not been propagated to the production DB. /cc @dominikl @francesw
Otherwise, studies.zip contains the JSON generated by python pyidr/study_parser.py --report
for a few studies of various types with ed34234
With the last commit (API still needs to be improved), studies.zip contains a new version of JSON generated for most of the published studies with all the multi-value keys now split as lists e.g.
(base) sbesson@ls30630:idr-utils (study_json) $ cat studies/idr0016.json
{
"Experiments": [],
"Screens": [
{
"Screen Description": "Using the cell-painting assay developed by Gustafsdottir et al, 2013, the Broad Institute has assembled a reference dataset of profiles for U2OS osteosarcoma cells treated with ~30,000 compounds. The experiment consisted of 413 microtiter plates. Each plate has 384 wells. Each well has 6 fields of view; a very small fraction of the wells (0.002%) have a few sites missing. Each field was imaged in five channels (detection wavelengths), and each channel is stored as a separate, grayscale 16-bit TIFF image file.",
"Screen Imaging Method": [
"fluorescence microscopy"
],
"Screen Name": "idr0016-wawer-bioactivecompoundprofiling/screenA",
"Screen Number": "1",
"Screen Sample Type": [
"cell"
],
"Screen Technology Type": [
"compound screen"
],
"Screen Type": [
"primary screen"
]
}
],
"Study Accession": "idr0016",
"Study Author List": [
"Wawer MJ, Li K, Gustafsdottir SM, Ljosa V, Bodycombe NE, Marton MA, Sokolnicki KL, Bray MA, Kemp MM, Winchester E, Taylor B, Grant GB, Hon CS, Duvall JR, Wilson JA, Bittker JA, Dan\u010d\u00edk V, Narayan R, Subramanian A, Winckler W, Golub TR, Carpenter AE, Shamji AF, Schreiber SL, Clemons PA",
"Bray MA, Gustafsdottir SM, Rohban MH, Singh S, Ljosa V, Sokolnicki KL, Bittker JA, Bodycombe NE, Danc\u00edk V, Hasaka TP, Hon CS, Kemp MM, Li K, Walpita D, Wawer MJ, Golub TR, Schreiber SL, Clemons PA, Shamji AF, Carpenter AE"
],
"Study Copyright": "Wawer et al",
"Study DOI": [
"http://dx.doi.org/10.1073/pnas.1410933111",
"http://dx.doi.org/10.1093/gigascience/giw014"
],
"Study Description": "Phenotypic profiling attempts to summarize multiparametric, feature-based analysis of cellular phenotypes of each sample so that similarities between profiles reflect similarities between samples. This image set provides a basis for testing image-based profiling methods wrt. to their ability to distinguish the effects of small molecules. The images are of U2OS cells treated with each of over 30,000 known bioactive compounds and labeled with six labels that characterize seven organelles (the cell-painting assay). Gustafsdottir et al. (doi:10.1371/journal.pone.0080999) have developed a multiplex cytological profiling assay that \"paints the cell\" with as many fluorescent markers as possible without compromising our ability to extract rich, quantitative profiles in high throughput. The assay detects seven major cellular components. In a pilot screen of bioactive compounds, the assay detected a range of cellular phenotypes and it clustered compounds with similar annotated protein targets or chemical structure based on cytological profiles. The results demonstrate that the assay captures subtle patterns in the combination of morphological labels, thereby detecting the effects of chemical compounds even though their targets are not stained directly. This image-based assay provides an unbiased approach to characterize compound- and disease-associated cell states to support future probe discovery. Using the cell-painting assay, the Broad Institute has assembled a reference dataset of profiles for U2OS osteosarcoma cells treated with ~30,000 compounds. The compound collection includes DOS-derived compounds (20,000), as well as chemically diverse MLI compounds with biologically diverse performance identified through analysis of PubChem (10,000), and known bioactive compounds to serve as landmarks (2,500). The DOS compounds consist of structurally diverse and stereochemically rich compounds with structures distinct from the current MLSMR. The compound collection also includes 267 distinct compounds nominated by MLPCN Centers from projects for which the Centers would like to identify new chemical series with similar activities.",
"Study External URL": [
"http://www.cellimagelibrary.org/pages/project_20269",
"http://gigadb.org/dataset/100200"
],
"Study License": "CC0 1.0",
"Study License URL": "https://creativecommons.org/publicdomain/zero/1.0/",
"Study Name": "idr0016-wawer-bioactivecompoundprofiling",
"Study Organism": [
"Homo sapiens"
],
"Study Organism Term Accession": [
"NCBITaxon_9606"
],
"Study Organism Term Source REF": [
"NCBITaxon"
],
"Study PMC ID": [
"PMC4121832",
"PMC5721342"
],
"Study Person Address": [
"Broad Institute of Harvard and MIT, Cambridge, MA, USA",
"Broad Institute of Harvard and MIT, Cambridge, MA, USA"
],
"Study Person Email": [
"anne@broadinstitute.org",
"pclemons@broadinstitute.org."
],
"Study Person First Name": [
"Anne",
"Paul"
],
"Study Person Last Name": [
"Carpenter",
"Clemons"
],
"Study Person Roles": [
"submitter",
"submitter"
],
"Study PubMed ID": [
"25024206",
"28327978"
],
"Study Public Release Date": "2016-12-16",
"Study Publication Title": [
"Toward performance-diverse small-molecule libraries for cell-based phenotypic screening using multiplexed high-dimensional profiling.",
"A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay."
],
"Study Screens Number": "1",
"Study Title": "Human U2OS cells - compound cell-painting experiment",
"Study Type": [
"high content screen"
],
"Study Type Term Accession": [
"EFO_0007550"
],
"Study Type Term Source REF": [
"EFO"
],
"Study Version History": "In August 2017 the 20 plates from the Broad Bioimaging Benchmark Collection dataset BBBC022 were moved to a separate screen, idr0036/screenA. At the same time 58 new plates, which had become available from the Cell Image Library (http://www.cellimagelibrary.org/pages/project_20269) after the original import to idr0016, were added to idr0016 and annotation was added to plate 25723 (https://idr.openmicroscopy.org/webclient/?show=plate-5104) which was previously unannotated. Protocols have been updated to reflect the materials and methods information in Wawer et al 2014.",
"Term Source Name": [
"NCBITaxon",
"EFO",
"CMPO",
"FBbi"
],
"Term Source URI": [
"http://purl.obolibrary.org/obo/",
"http://www.ebi.ac.uk/efo/",
"http://www.ebi.ac.uk/cmpo/",
"http://purl.obolibrary.org/obo/"
]
}
With the last commit, running the parser + OMERO formatter in check-mode on test104
[sbesson@test104-omeroreadwrite idr-utils]$ find /uod/idr/metadata/ -maxdepth 2 -iname *study.txt -exec sh -c "echo {} && /opt/omero/server/venv3/bin/python pyidr/study_parser.py {} --check -q" \;
/uod/idr/metadata/idr0001-graml-sysgro/idr0001-study.txt
/uod/idr/metadata/idr0002-heriche-condensation/idr0002-study.txt
/uod/idr/metadata/idr0003-breker-plasticity/idr0003-study.txt
/uod/idr/metadata/idr0004-thorpe-rad52/idr0004-study.txt
/uod/idr/metadata/idr0006-fong-nuclearbodies/idr0006-study.txt
/uod/idr/metadata/idr0007-srikumar-sumo/idr0007-study.txt
/uod/idr/metadata/idr0008-rohn-actinome/idr0008-study.txt
/uod/idr/metadata/idr0009-simpson-secretion/idr0009-study.txt
/uod/idr/metadata/idr0010-doil-dnadamage/idr0010-study.txt
/uod/idr/metadata/idr0011-ledesmafernandez-dad4/idr0011-study.txt
/uod/idr/metadata/idr0012-fuchs-cellmorph/idr0012-study.txt
/uod/idr/metadata/idr0013-neumann-mitocheck/idr0013-study.txt
/uod/idr/metadata/idr0015-colin-taraoceans/idr0015-study.txt
ERROR:pyidr.study_parser.Formatter:Mismatching annotation
/uod/idr/metadata/idr0016-wawer-bioactivecompoundprofiling/idr0016-study.txt
/uod/idr/metadata/idr0017-breinig-drugscreen/idr0017-study.txt
/uod/idr/metadata/idr0018-neff-histopathology/idr0018-study.txt
/uod/idr/metadata/idr0019-sero-nfkappab/idr0019-study.txt
ERROR:pyidr.study_parser.Formatter:Mismatching annotation
/uod/idr/metadata/idr0020-barr-chtog/idr0020-study.txt
/uod/idr/metadata/idr0021-lawo-pericentriolarmaterial/idr0021-study.txt
/uod/idr/metadata/idr0022-koedoot-cellmigration/idr0022-study.txt
/uod/idr/metadata/idr0023-szymborska-nuclearpore/idr0023-study.txt
/uod/idr/metadata/idr0025-stadler-proteinatlas/idr0025-study.txt
/uod/idr/metadata/idr0026-weigelin-immunotherapy/idr0026-study.txt
/uod/idr/metadata/idr0027-dickerson-chromatin/idr0027-study.txt
/uod/idr/metadata/idr0028-pascualvargas-rhogtpases/idr0028-study.txt
/uod/idr/metadata/idr0030-sero-yap/idr0030-study.txt
/uod/idr/metadata/idr0032-yang-meristem/idr0032-study.txt
/uod/idr/metadata/idr0033-rohban-pathways/idr0033-study.txt
/uod/idr/metadata/idr0034-kilpinen-hipsci/idr0034-study.txt
/uod/idr/metadata/idr0035-caie-drugresponse/idr0035-study.txt
/uod/idr/metadata/idr0036-gustafsdottir-cellpainting/idr0036-study.txt
/uod/idr/metadata/idr0037-vigilante-hipsci/idr0037-study.txt
/uod/idr/metadata/idr0038-held-kidneylightsheet/idr0038-study.txt
/uod/idr/metadata/idr0040-aymoz-singlecell/idr0040-study.txt
/uod/idr/metadata/idr0041-cai-mitoticatlas/idr0041-study.txt
/uod/idr/metadata/idr0042-nirschl-wsideeplearning/idr0042-study.txt
/uod/idr/metadata/idr0043-uhlen-humanproteinatlas/idr0043-study.txt
/uod/idr/metadata/idr0044-mcdole-tardislightsheet/idr0044-study.txt
/uod/idr/metadata/idr0045-reichmann-zygotespindle/idr0045-study.txt
/uod/idr/metadata/idr0047-neuert-yeastmrna/idr0047-study.txt
/uod/idr/metadata/idr0048-abdeladim-chroms/idr0048-study.txt
/uod/idr/metadata/idr0050-springer-cytoskeletalsystems/idr0050-study.txt
/uod/idr/metadata/idr0051-fulton-tailbudlightsheet/idr0051-study.txt
/uod/idr/metadata/idr0052-walther-condensinmap/idr0052-study.txt
/uod/idr/metadata/idr0053-faas-virtualnanoscopy/idr0053-study.txt
/uod/idr/metadata/idr0054-segura-tonsilhyperion/idr0054-study.txt
/uod/idr/metadata/idr0056-stojic-lncrnas/idr0056-study.txt
/uod/idr/metadata/idr0061-wolf-spindlepositioning/idr0061-study.txt
/uod/idr/metadata/idr0062-blin-nuclearsegmentation/idr0062-study.txt
/uod/idr/metadata/idr0063-newman-chromosomalloci/idr0063-study.txt
/uod/idr/metadata/idr0064-goglia-erkdynamics/idr0064-study.txt
/uod/idr/metadata/idr0065-camsund-crispri/idr0065-study.txt
/uod/idr/metadata/idr0066-voigt-mesospim/idr0066-study.txt
/uod/idr/metadata/idr0067-king-yeastmeiosis/idr0067-study.txt
/uod/idr/metadata/idr0069-caldera-perturbome/idr0069-study.txt
/uod/idr/metadata/idr0070-kerwin-hdbr/idr0070-study.txt
/uod/idr/metadata/idr0071-feldman-crisprko/idr0071-study.txt
/uod/idr/metadata/idr0072-schormann-subcellref/idr0072-study.txt
/uod/idr/metadata/idr0073-schaadt-immuneinfiltrates/idr0073-study.txt
/uod/idr/metadata/idr0075-cabirol-honeybee/idr0075-study.txt
/uod/idr/metadata/idr0076-ali-metabric/idr0076-study.txt
/uod/idr/metadata/idr0077-valuchova-flowerlightsheet/idr0077-study.txt
/uod/idr/metadata/idr0078-mattiazziusaj-endocyticcomp/idr0078-study.txt
/uod/idr/metadata/idr0079-hartmann-lateralline/idr0079-study.txt
/uod/idr/metadata/idr0080-way-perturbation/idr0080-study.txt
/uod/idr/metadata/idr0081-georgi-adenovirus/idr0081-study.txt
/uod/idr/metadata/idr0082-pennycuick-lesions/idr0082-study.txt
/uod/idr/metadata/idr0083-lamers-sarscov2/idr0083-study.txt
/uod/idr/metadata/idr0084-oudelaar-alphaglobin/idr0084-study.txt
/uod/idr/metadata/idr0085-walsh-mfhrem/idr0085-study.txt
/uod/idr/metadata/idr0086-miron-micrographs/idr0086-study.txt
/uod/idr/metadata/idr0087-paci-nuclearimport/idr0087-study.txt
/uod/idr/metadata/idr0088-cox-phenomicprofiling/idr0088-study.txt
/uod/idr/metadata/idr0089-fischl-coldtemp/idr0089-study.txt
/uod/idr/metadata/idr0090-ashdown-malaria/idr0090-study.txt
/uod/idr/metadata/idr0091-julou-lacinduction/idr0091-study.txt
/uod/idr/metadata/idr0092-ostrop-organoid/idr0092-study.txt
/uod/idr/metadata/idr0093-mueller-perturbation/idr0093-study.txt
/uod/idr/metadata/idr0094-ellinger-sarscov2/idr0094-study.txt
/uod/idr/metadata/idr0095-ali-asymmetry/idr0095-study.txt
/uod/idr/metadata/idr0097-reicher-proteintag/idr0097-study.txt
/uod/idr/metadata/idr0098-huang-octmos/idr0098-study.txt
/uod/idr/metadata/idr0099-jain-beetlelightsheet/idr0099-study.txt
/uod/idr/metadata/idr0100-capar-myelin/idr0100-study.txt
/uod/idr/metadata/idr0103-coomer-hiv1fusion/idr0103-study.txt
/uod/idr/metadata/idr0106-kubota-lunglightsheet/idr0106-study.txt
/uod/idr/metadata/idr0109-zaritsky-melanoma/idr0109-study.txt
/uod/idr/metadata/idr0114-lindsay-hdbr/idr0114-study.txt
/uod/idr/metadata/idr0107-morgan-hei10/idr0107-study.txt
/uod/idr/metadata/idr0111-lee-cellmigration/idr0111-study.txt
/uod/idr/metadata/idr0110-rodermund-xistrna/idr0110-study.txt
/uod/idr/metadata/idr0113-bottes-opcclones/idr0113-study.txt
/uod/idr/metadata/idr0108-sabinina-nuclearporecomplex/idr0108-study.txt
/uod/idr/metadata/idr0117-croce-marimba/idr0117-study.txt
/uod/idr/metadata/idr0112-verzat-motorneurons/idr0112-study.txt
/uod/idr/metadata/idr0101-payne-insitugenomeseq/idr0101-study.txt
/uod/idr/metadata/idr0116-deboer-npod/idr0116-study.txt
/uod/idr/metadata/idr0005-toret-adhesion/idr0005-study.txt
/uod/idr/metadata/idr0118-keenan-flylightsheet/idr0118-study.txt
/uod/idr/metadata/idr0096-tratwal-marrowquant/idr0096-study.txt
/uod/idr/metadata/idr0068-shah-zebrafishlightsheet/idr0068-study.txt
/uod/idr/metadata/idr0124-esteban-heartmorphogenesis/idr0124-study.txt
The two mismatching annotations are related to undefined Screen Type
values in idr0015
and idr0019
which are currently generated keys with empty values in the map annotation.
The workflow artifacts in https://github.com/IDR/idr-metadata/actions/runs/1745648670 contain the JSON representation of all study files.
As a lot of refactoring already happened and the annotation is largely unmodified barring the Screen Type
element discussed above, proposing to evaluate these changes as they stand. Follow-up PRs can then focus on refining the JSON formatting of the study files.
Looks good.
It would be nice if "Study Author List"
was actually a list with 1 author per item.
All studies have just a single multi-author item in the list, except idr0016 which has 2 items in the list.
Then we add these as JSON attachments to each study (one or all of the Projects/Screens named idr0NNN
)?
And then what's the API used to retrieve them?
An IDR-specific URL (in what web app?) e.g. /gallery/idr/study/0123/json/
or something generic like /webclient/file_annotation/?name=idr0123.json
?
All studies have just a single multi-author item in the list, except idr0016 which has 2 items in the list.
idr0035
has 5 associated publications, and hence 5 separate author lists.
Thinking on how to turn this into Linked Data, the Author
concept might be a good one as it can be mapped to existing types like schema.org/Person. We might need however to differentiate two author concepts:
Study Author List
is related to the scientific publication(s) associated with the datasetStudy Person
is the list of contacts associated with the datasetIn terms of metadata, the latter one offers more advantages and contains more information.
Then we add these as JSON attachments to each study (one or all of the Projects/Screens named idr0NNN)?
Attaching would be easy but as you mention we might want to define how people should consume these. I was initially trying to build a JSON representation that would contain all necessary information to generate study landing pages.
Coming back (arguably lately to it), I am in-between two minds about this feature. Looking at the files diff, it is a substantial diff and although the description tries to capture the set of changes, I would need to review/retest it more closely to make sure I would be happy for this script to be used in production.
As indicated in the PR body, the context of this investigation was the generation of study landing pages. Since then, we decided to focus our efforts on the development of the search engine/UI. So there is obviously some concerns about pushing a substantial rewrite without a clear consumer or more people looking at.
If we agree it is safer not to merge this in, options are either to keep this open until we come back to it or close and convert into an issue.
As part of the ongoing investigation around landing pages, this PR reviews the
study_parser
utility which contains most of the logic to parse the study file and generate the OMERO annotations at the Project/Screen level. The main limitation at the moment is the outcome of the--report
option is very tailored to the way the OMERO annotations are constructed.To achiebe this, this PR focuses on separating the file parsing logic from the formatting logic to allow us to create different representations:
Parser
class now generates an internal representation of the study as close as possible to the content of the study file with known keys defined inKEYS
parsed, validated when applicable and stored either under thestudy
dictionary or thecomponents
list (ordered list of screens/experiments).Study Accession
,Study Name
,Experiment Name
andScreen Name
keys are derived from internal keys namedComment.*
Study Publication Title
are now split in the parser as list of strings and the downstream method adjusted accordinglyFormatter
class is renamed asOMEROFormatter
and all the OMERO-specific logic is migrated thereJSONFormatter
class is added which creates a minimal JSON representation of a study. Initially all theStudy
keys at the top-level and twoExperiments
andScreens
elements which contain each of the study components--report
option is adjusted to use the JSONFormatter. A new--report-directory
option allows to write the output of the formatter under<out_dir>/<accession>.json
- see the artifacts https://github.com/IDR/idr-metadata/actions/runs/1745648670--check/--set
options are unchanged and still work on OMERO objects using the OMEROFormatterPotential next steps:
Protocols,
Feature...`)