AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 20 forks source link

List of Fields for Experiment and Sample Details #140

Closed dvenprasad closed 6 years ago

dvenprasad commented 6 years ago

To keep the prototype I'm building is consistent with the data available/extractable in the back end, here are the experiment fields(unordered):

@Miserlou @kurtwheeler @jaclyn-taroni:

@jaclyn-taroni: Are there any other fields which would be useful to include?

Miserlou commented 6 years ago

I have compiled a sorted count of every sample attribute available on ArrayExpress:

https://gist.github.com/Miserlou/f10955b5414f2dbf45277d73156233cd

I think we can reasonable normalize Age, Gender and Disease. I don't know if Disease is one you want here, but I think it's one of the more doable ones and could be useful, I think.

dvenprasad commented 6 years ago

Yes, disease would be very useful. Is it also possible to get the rest of the not normaliziable submitter provided metadata as key value pairs?

Miserlou commented 6 years ago

Here's the total (unsorted) usage of all key values on AE:

{u'accession': 47976,
 u'anonymousreview': 220,
 u'arraydesign': 40528,
 u'arraydesign__accession': 7173,
 u'arraydesign__count': 7173,
 u'arraydesign__id': 7173,
 u'arraydesign__legacy_id': 5278,
 u'arraydesign__name': 7173,
 u'arraydesign_accession': 37705,
 u'arraydesign_count': 37705,
 u'arraydesign_id': 37705,
 u'arraydesign_legacy_id': 29157,
 u'arraydesign_name': 37705,
 u'assays': 47976,
 u'bibliography': 21484,
 u'bibliography__accession': 2779,
 u'bibliography__authors': 3563,
 u'bibliography__doi': 2232,
 u'bibliography__issue': 1811,
 u'bibliography__pages': 1934,
 u'bibliography__publication': 851,
 u'bibliography__publisher': 2,
 u'bibliography__status': 793,
 u'bibliography__title': 3568,
 u'bibliography__uri': 348,
 u'bibliography__volume': 673,
 u'bibliography__year': 2098,
 u'bibliography_accession': 17871,
 u'bibliography_authors': 18725,
 u'bibliography_doi': 11387,
 u'bibliography_edition': 2,
 u'bibliography_issue': 2376,
 u'bibliography_pages': 2800,
 u'bibliography_publication': 3065,
 u'bibliography_publisher': 18,
 u'bibliography_status': 2849,
 u'bibliography_title': 19032,
 u'bibliography_uri': 990,
 u'bibliography_volume': 2552,
 u'bibliography_year': 3297,
 u'bioassaydatagroup': 47976,
 u'bioassaydatagroup__arraydesignprovider': 120359,
 u'bioassaydatagroup__bioassaydatacubes': 120359,
 u'bioassaydatagroup__bioassays': 120359,
 u'bioassaydatagroup__dataformat': 120359,
 u'bioassaydatagroup__id': 120359,
 u'bioassaydatagroup__isderived': 120359,
 u'bioassaydatagroup__name': 120359,
 u'bioassaydatagroup_arraydesignprovider': 5862,
 u'bioassaydatagroup_bioassaydatacubes': 5862,
 u'bioassaydatagroup_bioassays': 5862,
 u'bioassaydatagroup_dataformat': 5862,
 u'bioassaydatagroup_id': 5862,
 u'bioassaydatagroup_isderived': 5862,
 u'bioassaydatagroup_name': 5862,
 u'description': 47976,
 u'description_id': 47976,
 u'description_text': 47976,
 u'description_text__a': 1788,
 u'description_text__a_$': 1785,
 u'description_text__a__': 1,
 u'description_text__a__blank': 1,
 u'description_text__a_bugs.sgul.ac.uk': 1,
 u'description_text__a_e-bugs-129': 1,
 u'description_text__a_href': 1788,
 u'description_text__a_target': 1776,
 u'description_text__br': 1400,
 u'description_text__i': 48,
 u'description_text__i__i': 1,
 u'experimentalfactor': 39946,
 u'experimentalfactor__name': 64800,
 u'experimentalfactor__value': 64800,
 u'experimentalfactor_name': 18223,
 u'experimentalfactor_value': 18223,
 u'experimentdesign': 12373,
 u'experimenttype': 47976,
 u'files': 47976,
 u'files_': 47976,
 u'files_biosamples': 784,
 u'files_biosamples_png': 784,
 u'files_biosamples_png_name': 784,
 u'files_biosamples_svg': 784,
 u'files_biosamples_svg_name': 784,
 u'files_fgem': 40543,
 u'files_fgem_available': 40543,
 u'files_fgem_name': 40543,
 u'files_idf': 47976,
 u'files_idf_name': 47976,
 u'files_raw': 40650,
 u'files_raw_celcount': 40650,
 u'files_raw_count': 40650,
 u'files_raw_name': 40650,
 u'files_sdrf': 47976,
 u'files_sdrf_name': 47976,
 u'id': 47976,
 u'lastupdatedate': 47573,
 u'miamescores': 40586,
 u'miamescores_derivedbioassaydatascore': 40586,
 u'miamescores_factorvaluescore': 40586,
 u'miamescores_measuredbioassaydatascore': 40586,
 u'miamescores_overallscore': 40586,
 u'miamescores_protocolscore': 40586,
 u'miamescores_reportersequencescore': 40586,
 u'minseqescores': 8072,
 u'minseqescores_derivedbioassaydatascore': 8072,
 u'minseqescores_experimentdesignscore': 8072,
 u'minseqescores_factorvaluescore': 8072,
 u'minseqescores_measuredbioassaydatascore': 8072,
 u'minseqescores_overallscore': 8072,
 u'minseqescores_protocolscore': 8072,
 u'name': 47976,
 u'organism': 47974,
 u'processeddatafiles': 47976,
 u'processeddatafiles_available': 47976,
 u'protocol': 47887,
 u'protocol__accession': 349975,
 u'protocol__id': 349975,
 u'protocol_accession': 128,
 u'protocol_id': 128,
 u'provider': 47970,
 u'provider__contact': 188399,
 u'provider__email': 188399,
 u'provider__role': 188399,
 u'provider_contact': 13537,
 u'provider_email': 13537,
 u'provider_role': 13537,
 u'rawdatafiles': 47976,
 u'rawdatafiles_available': 47976,
 u'relatedexperiment': 204,
 u'releasedate': 47976,
 u'sampleattribute': 47975,
 u'sampleattribute__category': 201628,
 u'sampleattribute__value': 201628,
 u'sampleattribute_category': 3593,
 u'sampleattribute_value': 3593,
 u'samples': 47976,
 u'secondaryaccession': 40269,
 u'seqdatauri': 7427,
 u'species': 47974,
 u'submissiondate': 8671}

And a random Experiment response example:

{u'accession': u'E-MTAB-3684',
 u'arraydesign': {u'accession': u'A-AGIL-28',
                  u'count': 12,
                  u'id': 11853,
                  u'legacy_id': 1316855037,
                  u'name': u'Agilent Whole Human Genome Microarray 4x44K 014850 G4112F (85 cols x 532 rows)'},
 u'assays': 12,
 u'bioassaydatagroup': {u'arraydesignprovider': None,
                        u'bioassaydatacubes': 12,
                        u'bioassays': 12,
                        u'dataformat': u'rawData',
                        u'id': None,
                        u'isderived': 0,
                        u'name': u'rawData'},
 u'description': {u'id': None,
                  u'text': u'621-101 cells were treated with rapamycin and rhebsiRna and samples were run in triplicates for each condition.'},
 u'experimentalfactor': {u'name': u'compound',
                         u'value': [u'control siRNA',
                                    u'DMSO',
                                    u'rapamycin',
                                    u'Rheb siRNA']},
 u'experimenttype': u'transcription profiling by array',
 u'files': {u'': None,
            u'idf': {u'name': u'E-MTAB-3684.idf.txt'},
            u'raw': {u'celcount': 0,
                     u'count': True,
                     u'name': u'E-MTAB-3684.raw.1.zip'},
            u'sdrf': {u'name': u'E-MTAB-3684.sdrf.txt'}},
 u'id': 524913,
 u'lastupdatedate': u'2017-03-16',
 u'miamescores': {u'derivedbioassaydatascore': 0,
                  u'factorvaluescore': 1,
                  u'measuredbioassaydatascore': 1,
                  u'overallscore': 3,
                  u'protocolscore': 0,
                  u'reportersequencescore': 1},
 u'name': u'Gene expression profiling of 621-101 cells with rapamycin treatment and Rheb downregulation',
 u'organism': u'Homo sapiens',
 u'processeddatafiles': {u'available': False},
 u'protocol': [{u'accession': u'P-MTAB-45363', u'id': 1139880},
               {u'accession': u'P-MTAB-45364', u'id': 1139879},
               {u'accession': u'P-MTAB-45365', u'id': 1139882},
               {u'accession': u'P-MTAB-45366', u'id': 1139881},
               {u'accession': u'P-MTAB-45361', u'id': 1139883},
               {u'accession': u'P-MTAB-45362', u'id': 1139884}],
 u'provider': [{u'contact': u'MAGDALENA KARBOWNICZEK',
                u'email': u'magdalena.karbowniczek@ttuhsc.edu',
                u'role': u'investigator'},
               {u'contact': u'Elizabeth Henske',
                u'email': None,
                u'role': u'investigator'},
               {u'contact': u'SASIKANTH MANNE',
                u'email': u'sasikanthmanne@gmail.com',
                u'role': u'submitter'}],
 u'rawdatafiles': {u'available': True},
 u'releasedate': u'2018-01-01',
 u'sampleattribute': [{u'category': u'cell line', u'value': u'621-101'},
                      {u'category': u'disease', u'value': u'Angiomyolipoma'},
                      {u'category': u'organism', u'value': u'Homo sapiens'}],
 u'samples': 12,
 u'species': u'Homo sapiens',
 u'submissiondate': u'2009-04-27'}

There is no institution information other than what we can deduce from the email addresses, but email addresses aren't always present.

jaclyn-taroni commented 6 years ago

If I look at the "Investigation description" file for E-MTAB-3684 from the web UI , the "Submitter Institution" piece of information I would be interested in is the "Person Affiliation" field. Here it's Texas Tech University Health Sciences Center. Is it possible to retrieve that information programmatically? Depending on how the filtering is performed (autocomplete?), I think filtering on this may be possible (#139).

Miserlou commented 6 years ago

Oh wow, of course there is. Yeah, that's no problem if every experiment has one of those files, which I think they generally do.

cgreene commented 6 years ago

Also if those files have the same contents. @Miserlou, would it be possible to summarize the completeness of fields in those files like you did with ArrayExpress?

jaclyn-taroni commented 6 years ago

@Miserlou information only question - are these key-value pairs on an experiment accession basis? Is it then correct that there are 47976 experiments in ArrayExpress?

Miserlou commented 6 years ago

I just modified Casey's old script which had a set of JSON files per year, so maybe some of 2016/7/8(?) are missing. AE homepage says that have 70771 experiments, not 47976, but maybe there are 22000 in the database which don't have any accessions with them. Either way, it's not really meant to be exhaustive but show that it's a very steep curve but probably doable for come common-ish fields.

jaclyn-taroni commented 6 years ago

Got it, thanks! I think if we snag the information in the .idf.txt files as well (which may not be the best way to go about it/the only place that information is available), we'll increase the number of fields 👍

I believe this is related to #95

jaclyn-taroni commented 6 years ago
  • Wet lab protocols

For this one, I think the better way to think about this is "submitter-supplied protocols." In keeping with our E-MTAB-3684 example, it would be the information here: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3684/protocols/

I would not necessarily display that upfront if we've processed the data (rather than "NO-OP") but should be easily accessible to the user.

If we've processed the data (is_ccdl if I'm not mistaken?), I would imagine the most important information to report upfront is what we've done to it.

jaclyn-taroni commented 6 years ago
  • List of samples

I'll note that I think the number of samples should probably be reported in the search results.

Miserlou commented 6 years ago

[For myself] Example with publication: https://www.ebi.ac.uk/arrayexpress/files/E-GEOD-26114/E-GEOD-26114.idf.txt

Has fields: Publication Title
Publication Author List
PubMed ID
Publication DOI

jaclyn-taroni commented 6 years ago

@dvenprasad let me know if I'm blocking you on this issue or #139! Those fields seem pretty reasonable to me at this stage. We can see if folks ask for anything else.

Miserlou commented 6 years ago

These are all the IDF file key values. This took a long time since it was a total scrape, not using prescraped data.

https://gist.github.com/Miserlou/daaf0922cb26989b9049b79615e537d5

jaclyn-taroni commented 6 years ago

@dvenprasad can you take a look at this and give us an update about what would be required to close this/when that might happen?

Miserlou commented 6 years ago

This has all been taken care of by the files attached to this ticket and the work in https://github.com/AlexsLemonade/refinebio/issues/165.