Closed dvenprasad closed 6 years ago
I have compiled a sorted count of every sample attribute available on ArrayExpress:
https://gist.github.com/Miserlou/f10955b5414f2dbf45277d73156233cd
I think we can reasonable normalize Age, Gender and Disease. I don't know if Disease is one you want here, but I think it's one of the more doable ones and could be useful, I think.
Yes, disease would be very useful. Is it also possible to get the rest of the not normaliziable submitter provided metadata as key value pairs?
Here's the total (unsorted) usage of all key values on AE:
{u'accession': 47976,
u'anonymousreview': 220,
u'arraydesign': 40528,
u'arraydesign__accession': 7173,
u'arraydesign__count': 7173,
u'arraydesign__id': 7173,
u'arraydesign__legacy_id': 5278,
u'arraydesign__name': 7173,
u'arraydesign_accession': 37705,
u'arraydesign_count': 37705,
u'arraydesign_id': 37705,
u'arraydesign_legacy_id': 29157,
u'arraydesign_name': 37705,
u'assays': 47976,
u'bibliography': 21484,
u'bibliography__accession': 2779,
u'bibliography__authors': 3563,
u'bibliography__doi': 2232,
u'bibliography__issue': 1811,
u'bibliography__pages': 1934,
u'bibliography__publication': 851,
u'bibliography__publisher': 2,
u'bibliography__status': 793,
u'bibliography__title': 3568,
u'bibliography__uri': 348,
u'bibliography__volume': 673,
u'bibliography__year': 2098,
u'bibliography_accession': 17871,
u'bibliography_authors': 18725,
u'bibliography_doi': 11387,
u'bibliography_edition': 2,
u'bibliography_issue': 2376,
u'bibliography_pages': 2800,
u'bibliography_publication': 3065,
u'bibliography_publisher': 18,
u'bibliography_status': 2849,
u'bibliography_title': 19032,
u'bibliography_uri': 990,
u'bibliography_volume': 2552,
u'bibliography_year': 3297,
u'bioassaydatagroup': 47976,
u'bioassaydatagroup__arraydesignprovider': 120359,
u'bioassaydatagroup__bioassaydatacubes': 120359,
u'bioassaydatagroup__bioassays': 120359,
u'bioassaydatagroup__dataformat': 120359,
u'bioassaydatagroup__id': 120359,
u'bioassaydatagroup__isderived': 120359,
u'bioassaydatagroup__name': 120359,
u'bioassaydatagroup_arraydesignprovider': 5862,
u'bioassaydatagroup_bioassaydatacubes': 5862,
u'bioassaydatagroup_bioassays': 5862,
u'bioassaydatagroup_dataformat': 5862,
u'bioassaydatagroup_id': 5862,
u'bioassaydatagroup_isderived': 5862,
u'bioassaydatagroup_name': 5862,
u'description': 47976,
u'description_id': 47976,
u'description_text': 47976,
u'description_text__a': 1788,
u'description_text__a_$': 1785,
u'description_text__a__': 1,
u'description_text__a__blank': 1,
u'description_text__a_bugs.sgul.ac.uk': 1,
u'description_text__a_e-bugs-129': 1,
u'description_text__a_href': 1788,
u'description_text__a_target': 1776,
u'description_text__br': 1400,
u'description_text__i': 48,
u'description_text__i__i': 1,
u'experimentalfactor': 39946,
u'experimentalfactor__name': 64800,
u'experimentalfactor__value': 64800,
u'experimentalfactor_name': 18223,
u'experimentalfactor_value': 18223,
u'experimentdesign': 12373,
u'experimenttype': 47976,
u'files': 47976,
u'files_': 47976,
u'files_biosamples': 784,
u'files_biosamples_png': 784,
u'files_biosamples_png_name': 784,
u'files_biosamples_svg': 784,
u'files_biosamples_svg_name': 784,
u'files_fgem': 40543,
u'files_fgem_available': 40543,
u'files_fgem_name': 40543,
u'files_idf': 47976,
u'files_idf_name': 47976,
u'files_raw': 40650,
u'files_raw_celcount': 40650,
u'files_raw_count': 40650,
u'files_raw_name': 40650,
u'files_sdrf': 47976,
u'files_sdrf_name': 47976,
u'id': 47976,
u'lastupdatedate': 47573,
u'miamescores': 40586,
u'miamescores_derivedbioassaydatascore': 40586,
u'miamescores_factorvaluescore': 40586,
u'miamescores_measuredbioassaydatascore': 40586,
u'miamescores_overallscore': 40586,
u'miamescores_protocolscore': 40586,
u'miamescores_reportersequencescore': 40586,
u'minseqescores': 8072,
u'minseqescores_derivedbioassaydatascore': 8072,
u'minseqescores_experimentdesignscore': 8072,
u'minseqescores_factorvaluescore': 8072,
u'minseqescores_measuredbioassaydatascore': 8072,
u'minseqescores_overallscore': 8072,
u'minseqescores_protocolscore': 8072,
u'name': 47976,
u'organism': 47974,
u'processeddatafiles': 47976,
u'processeddatafiles_available': 47976,
u'protocol': 47887,
u'protocol__accession': 349975,
u'protocol__id': 349975,
u'protocol_accession': 128,
u'protocol_id': 128,
u'provider': 47970,
u'provider__contact': 188399,
u'provider__email': 188399,
u'provider__role': 188399,
u'provider_contact': 13537,
u'provider_email': 13537,
u'provider_role': 13537,
u'rawdatafiles': 47976,
u'rawdatafiles_available': 47976,
u'relatedexperiment': 204,
u'releasedate': 47976,
u'sampleattribute': 47975,
u'sampleattribute__category': 201628,
u'sampleattribute__value': 201628,
u'sampleattribute_category': 3593,
u'sampleattribute_value': 3593,
u'samples': 47976,
u'secondaryaccession': 40269,
u'seqdatauri': 7427,
u'species': 47974,
u'submissiondate': 8671}
And a random Experiment response example:
{u'accession': u'E-MTAB-3684',
u'arraydesign': {u'accession': u'A-AGIL-28',
u'count': 12,
u'id': 11853,
u'legacy_id': 1316855037,
u'name': u'Agilent Whole Human Genome Microarray 4x44K 014850 G4112F (85 cols x 532 rows)'},
u'assays': 12,
u'bioassaydatagroup': {u'arraydesignprovider': None,
u'bioassaydatacubes': 12,
u'bioassays': 12,
u'dataformat': u'rawData',
u'id': None,
u'isderived': 0,
u'name': u'rawData'},
u'description': {u'id': None,
u'text': u'621-101 cells were treated with rapamycin and rhebsiRna and samples were run in triplicates for each condition.'},
u'experimentalfactor': {u'name': u'compound',
u'value': [u'control siRNA',
u'DMSO',
u'rapamycin',
u'Rheb siRNA']},
u'experimenttype': u'transcription profiling by array',
u'files': {u'': None,
u'idf': {u'name': u'E-MTAB-3684.idf.txt'},
u'raw': {u'celcount': 0,
u'count': True,
u'name': u'E-MTAB-3684.raw.1.zip'},
u'sdrf': {u'name': u'E-MTAB-3684.sdrf.txt'}},
u'id': 524913,
u'lastupdatedate': u'2017-03-16',
u'miamescores': {u'derivedbioassaydatascore': 0,
u'factorvaluescore': 1,
u'measuredbioassaydatascore': 1,
u'overallscore': 3,
u'protocolscore': 0,
u'reportersequencescore': 1},
u'name': u'Gene expression profiling of 621-101 cells with rapamycin treatment and Rheb downregulation',
u'organism': u'Homo sapiens',
u'processeddatafiles': {u'available': False},
u'protocol': [{u'accession': u'P-MTAB-45363', u'id': 1139880},
{u'accession': u'P-MTAB-45364', u'id': 1139879},
{u'accession': u'P-MTAB-45365', u'id': 1139882},
{u'accession': u'P-MTAB-45366', u'id': 1139881},
{u'accession': u'P-MTAB-45361', u'id': 1139883},
{u'accession': u'P-MTAB-45362', u'id': 1139884}],
u'provider': [{u'contact': u'MAGDALENA KARBOWNICZEK',
u'email': u'magdalena.karbowniczek@ttuhsc.edu',
u'role': u'investigator'},
{u'contact': u'Elizabeth Henske',
u'email': None,
u'role': u'investigator'},
{u'contact': u'SASIKANTH MANNE',
u'email': u'sasikanthmanne@gmail.com',
u'role': u'submitter'}],
u'rawdatafiles': {u'available': True},
u'releasedate': u'2018-01-01',
u'sampleattribute': [{u'category': u'cell line', u'value': u'621-101'},
{u'category': u'disease', u'value': u'Angiomyolipoma'},
{u'category': u'organism', u'value': u'Homo sapiens'}],
u'samples': 12,
u'species': u'Homo sapiens',
u'submissiondate': u'2009-04-27'}
There is no institution information other than what we can deduce from the email addresses, but email addresses aren't always present.
If I look at the "Investigation description" file for E-MTAB-3684 from the web UI , the "Submitter Institution" piece of information I would be interested in is the "Person Affiliation" field. Here it's Texas Tech University Health Sciences Center
. Is it possible to retrieve that information programmatically? Depending on how the filtering is performed (autocomplete?), I think filtering on this may be possible (#139).
Oh wow, of course there is. Yeah, that's no problem if every experiment has one of those files, which I think they generally do.
Also if those files have the same contents. @Miserlou, would it be possible to summarize the completeness of fields in those files like you did with ArrayExpress?
@Miserlou information only question - are these key-value pairs on an experiment accession basis? Is it then correct that there are 47976 experiments in ArrayExpress?
I just modified Casey's old script which had a set of JSON files per year, so maybe some of 2016/7/8(?) are missing. AE homepage says that have 70771 experiments, not 47976, but maybe there are 22000 in the database which don't have any accessions with them. Either way, it's not really meant to be exhaustive but show that it's a very steep curve but probably doable for come common-ish fields.
Got it, thanks! I think if we snag the information in the .idf.txt
files as well (which may not be the best way to go about it/the only place that information is available), we'll increase the number of fields 👍
I believe this is related to #95
- Wet lab protocols
For this one, I think the better way to think about this is "submitter-supplied protocols." In keeping with our E-MTAB-3684
example, it would be the information here: https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3684/protocols/
I would not necessarily display that upfront if we've processed the data (rather than "NO-OP") but should be easily accessible to the user.
If we've processed the data (is_ccdl
if I'm not mistaken?), I would imagine the most important information to report upfront is what we've done to it.
- List of samples
I'll note that I think the number of samples should probably be reported in the search results.
[For myself] Example with publication: https://www.ebi.ac.uk/arrayexpress/files/E-GEOD-26114/E-GEOD-26114.idf.txt
Has fields:
Publication Title
Publication Author List
PubMed ID
Publication DOI
@dvenprasad let me know if I'm blocking you on this issue or #139! Those fields seem pretty reasonable to me at this stage. We can see if folks ask for anything else.
These are all the IDF file key values. This took a long time since it was a total scrape, not using prescraped data.
https://gist.github.com/Miserlou/daaf0922cb26989b9049b79615e537d5
@dvenprasad can you take a look at this and give us an update about what would be required to close this/when that might happen?
This has all been taken care of by the files attached to this ticket and the work in https://github.com/AlexsLemonade/refinebio/issues/165.
To keep the prototype I'm building is consistent with the data available/extractable in the back end, here are the experiment fields(unordered):
@Miserlou @kurtwheeler @jaclyn-taroni:
@jaclyn-taroni: Are there any other fields which would be useful to include?