Closed jaclyn-taroni closed 6 years ago
Tagging @Miserlou. We'll probably need a new field on the /samples
endpoint with the Submitter Supplied Protocol.
@Ariel, just want to clarify, for sample GSM399040
, on the modal's "Submitter Supplied Protocol", you want to show all the information listed on https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM399040 or only some fields?
The fields that I think are important there are Extraction protocol
, Label protocol
, Hybridization protocol
, Scan protocol
and Data processing
.
@jaclyn-taroni Do you happen to know whether the backend already retrieves and saves the information on this page: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM399040 in the database? I checked the code but didn't find it.
Seems likely to be a @Miserlou question to me. I think he wrote that bit.
Also I can say, from looking at the metadata I got as part of the jackiecrunch, it appears to have protocol information in the sample annotations?
Since "Submitter Supplied Protocol" information totally depends on the ncbi webpage, can we simply provide that URL on our web UI? That way we don't need to worry about the synchronization of our database with ncbi website.
Question for @dvenprasad ☝️
Will also note that we'll need to do this for ArrayExpress samples as well, but ArrayExpress displays this information at the experiment level on their UI.
Here's an example: https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-31/protocols/
I checked the backend code and found that protocol information is saved in the database only at experiment level: https://github.com/AlexsLemonade/refinebio/blob/89db2284fcd7a853f185db4c2c6a4db15fcf8c03/common/data_refinery_common/models/models.py#L256 Is that true, @Miserlou?
Yes, unless there is more in source sample-specific metadata as we'll capture that too.
I think 'Submitter Supplied Protocol' should be displayed in the modal. It is one of the fields that users look at while making decisions about choosing samples. Linking them to GEO would impede that process.
@dvenprasad: What if NCBI updated the protocol information but our database hasn't synced with the update? Wouldn't that mislead our users?
I don't think that's a common occurrence. @jaclyn-taroni?
I agree. These accessions do get updated, but I suspect it is usually the sample labels, metadata, etc. rather than the protocols. If it a particular dataset’s protocol is that important, I would imagine that that user may contact the submitter.
I wonder whether it makes sense to grab the protocol information from front end dynamically (instead of saving them in backend database). That way the protocol information we show on our web UI will be always up-to-date.
I don't think we want to grab protocol information dynamically because then our front end will also have to be able to interact with multiple potential providers (arrayexpress, ena, SRA, GEO, & potentially more if we ingest more sources).
Do all samples have sample-level protocol information or some only have experiement-level protocol info?
I believe ArrayExpress experiments have experiment-level protocol information, whereas GEO samples have sample-level protocol info (based on the metadata I'm looking at). I can't speak to SRA, though.
@cgreene: Do SRA samples have experiment or sample-level protocol information? If they do, can you give me one example (like a URL)?
@jaclyn-taroni: Since ArrayExpress has experiment-level protocol informaiton, does that mean on our web UI, each sample will show the protocol information of the experiment that it belongs to? Is it possible that an ArrayExpress sample are associated with multiple experiments?
This SRA sample: https://www.ncbi.nlm.nih.gov/sra/SRX3713332[accn]
Has this BioSample accession: https://www.ncbi.nlm.nih.gov/biosample/SAMN08555006
Which links to this GEO record: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3014892
That appears to have some protocol information at the sample level. I'm not sure how deep our surveyor/downloader digs though and what is avaialble to us.
Have been chatting with @dongbohu -- our SRA surveyor is using EBI.
If we take a look at the sample XML data for the SRA sample @cgreene posted above: https://www.ebi.ac.uk/ena/data/view/SRS2971726&display=xml
We can see that there is no protocol information, but if we look at the experiment (e.g., SRX
) XML: https://www.ebi.ac.uk/ena/data/view/SRX3713332&display=xml
it contains the library construction protocol, which is what is displayed here.
That suggests that the protocol information for SRA is at the experiment level, rather than the sample level.
We don't get any "sample-level" information from SRA other than what we can infer and extract, related: https://github.com/AlexsLemonade/refinebio-frontend/issues/274
On Tue, Aug 28, 2018 at 11:50 AM, Jaclyn Taroni notifications@github.com wrote:
Have been chatting with @dongbohu https://github.com/dongbohu -- our SRA surveyor is using EBI.
If we take a look at the sample XML data for the SRA sample @cgreene https://github.com/cgreene posted above https://www.ncbi.nlm.nih.gov/sra/SRX3713332%5Baccn%5D: https://www.ebi.ac.uk/ena/data/view/SRS2971726&display=xml
We can see that there is no protocol information, but if we look at the experiment (e.g., SRX) XML: https://www.ebi.ac.uk/ena/ data/view/SRX3713332&display=xml
it contains the library construction protocol, which is what is displayed here https://www.ncbi.nlm.nih.gov/sra/SRX3713332%5Baccn%5D.
That suggests that the protocol information for SRA is at the experiment level, rather than the sample level.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/refinebio-frontend/issues/225#issuecomment-416637976, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIi0yCqGB1Nk79hy7Zi_amCq5fd0k5yks5uVWawgaJpZM4Vyrtw .
@arielsvn:
Here is the structure of the protocol information that the sample API will offer:
Field name: protocol_info
{
'Extraction protocol': [ string, ... ],
'Label protocol': [ string, ... ],
'Hybridization protocol': [ string, ... ],
'Scan protocol': [ string, ... ],
'Data processing': [ string, ... ],
'Reference': string_of_URL
}
Each value is an array of strings (but most arrays include only one string). The values of some of the keys may be blank. If that happens, you can either set NA
or ignore them on the web UI.
[
{
Accession: string,
Text: string,
Type: string,
Reference: string_of_URL
},
{
Accession: string,
Title: string,
Type: string,
Description: string
Reference: string_of_URL
},
...
]
[
{
Description: string,
Reference: string_of_URL
},
{
Description: string,
Reference: string_of_URL
},
...
]
But most likely this array will include one and only one element.
Please let me know if you have any questions.
@arielsvn: In case protocol_info
you get from each sample is an empty object or array, you can simply put NA
there. You can ask @dvenprasad for confirmation.
Thanks @dongbohu for the detailed explanation! I do have some questions:
Is it possible to add another field, and specify the type of sample?? To avoid having to test the structure of protocol_info
in order to determine the sample type.
Since the pipelines
field is obsolete. How are we going to be able to determine which samples are Submitter Processed?? Before pipelines == ['Submitter-processed']
is what identified these samples. For submitter processed samples we should display something similar to this design.
@dvenprasad The submitter processed protocol information will be different depending on the type of sample, as @dongbohu mentioned above. How the dialog should look for each case??
@jaclyn-taroni @dongbohu For ArrayExpress, are the accession fields samples accessions?
EDIT: For SRA, do each of the strings in the string array map to a sample or is it independent?
Would it be possible for me get some sample data from this API endpoint so I design a reasonable way to represent it on the UI?
@dvenprasad Here is an example of ArrayExpress sample's protocol info: https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-31/protocols/
For SRA, the protocol info is in <LIBRARY_CONSTRUCTION_PROTOCOL>
field. See this example:
https://www.ebi.ac.uk/ena/data/view/SRX3713332&display=xml
If an SRA sample is associated with multiple experiments, this LIBRARY_CONSTRUCTION_PROTOCOL
could be different (but very unlikely), that is why I set its type to an array of strings, but in most (if not all) cases, it should have only one string.
Thanks @dongbohu.
Follow-up question( most likely for @jaclyn-taroni) about the accession codes on ArrayExpress. They look like they are accession IDs for protocols. Is that correct?
If so, do these values mean anything outside of ArrayExpress? i.e does it make sense to display them on the UI?
If so, do these values mean anything outside of ArrayExpress?
Not to my knowledge. My suspicion is that the accession code is not that useful AND there are over half a million unique protocols apparently: https://www.ebi.ac.uk/arrayexpress/protocols/browse.html so, to me, that lends some weight to my gut feeling.
@jaclyn-taroni and @dvenprasad: I just found out that EBI provides the following detailed protocol information for each ArrayExpress experiment: https://www.ebi.ac.uk/arrayexpress/json/v3/experiments/E-MEXP-31/protocols
I reformatted the JSON content provided by this URL to make it more readable:
{
"protocols": {
"api-version": 3,
"api-revision": "091015",
"version":1.0,
"revision": "091015",
"total-protocols": 6,
"protocol": [
{
"id": 63,
"accession": "Affymetrix:Protocol:ExpressionStat",
"name": "Affymetrix:Protocol:ExpressionStat",
"text": "Title: Affymetrix CHP Analysis (ExpressionStat). Description:",
"type": "bioassay_data_transformation",
"performer": null,
"hardware": null,
"software": "MicroArraySuite 5.0",
"standardpublicprotocol": 1,
"parameter": [
"Algorithm name",
"Algorithm version",
"Alpha1",
"Alpha2",
"Baseline file",
"Gamma1H",
"Gamma1L",
"Gamma2H",
"Gamma2L",
"Mask file",
"Normalization factor",
"Perturbation",
"Scale factor",
"Tau"
]
},
{
"id": 79337,
"accession": "Affymetrix:Protocol:Hybridization-EukGE-WS2v4[]",
"name": "Affymetrix:Protocol:Hybridization-EukGE-WS2v4[]",
"text": "Title: Affymetrix EukGE-WS2v4 Hybridization. Description:",
"type": "hybridization",
"performer": null,
"hardware": null,
"software": "MicroArraySuite 5.0",
"standardpublicprotocol":1
},
{
"id":51,
"accession":"P-AFFY-6",
"name":"P-AFFY-6",
"text":"Title: Affymetrix CEL analysis. Description:",
"type":"feature_extraction",
"performer":null,
"hardware": "418 [Affymetrix]",
"software": "MicroArraySuite 5.0",
"standardpublicprotocol":1
},
{
"id": 230174,
"accession": "P-MEXP-1358",
"name": "P-MEXP-1358",
"text": "Pachytene spermatocytes and early spermatids were prepared by centrifugal elutriation to a purity greater than 90% from 8 rats at 90 dpp as previously described except that cells were mechanically dispersed (28). Purified cells were centrifuged, snap frozen in liquid nitrogen and stored at - 80°C. Total testicular samples were produced by excising and snap freezing testes from three Sprague-Dawley rats at 90 dpp in liquid nitrogen. The outermost connective tissue capsule was then surgically removed on the frozen organs before they were manually ground using a ceramic mortar and pistil. Total RNA was purified using the RNeasy kit (Qiagen) following the manufacturer’s instructions. The RNA was immediately snap frozen in liquid nitrogen and stored at –80°C. Samples from brain (Lewis, 60 dpp) and skeletal muscle (Wistar, 70 dpp) were isolated from adult rats according to standard procedures. ",
"type": "pool",
"performer": null,
"hardware": null,
"software": null,
"standardpublicprotocol": null
},
{
"id": 230175,
"accession": "P-MEXP-1359",
"name": "P-MEXP-1359",
"text": "Total RNA was purified using the RNeasy kit (Qiagen) following the manufacturer’s instructions. The RNA was immediately snap frozen in liquid nitrogen and stored at –80°C. ",
"type": "nucleic_acid_extraction",
"performer": null,
"hardware": null,
"software": null,
"standardpublicprotocol": null,
"parameter": [
"Amplification",
"Extracted product"
]
},
{
"id": 230173,
"accession": "P-MEXP-1360",
"name": "P-MEXP-1360",
"text": "Total RNA was prepared using RNeasy Mini-Spin columns (Qiagen) using standard protocols. RNA quality was monitored with RNA Nano 6000 Chips and the 2100 Bioanalyzer (Agilent). Labeling of total RNA was performed as described in the Expression Analysis Technical Manual (Affymetrix) with minor modifications as indicated below. Double-stranded (ds) cDNA was synthesized from 13 µg of total RNA using the Superscript II kit (Invitrogen Life Technologies) and a T7-(dT)24-VN primer 5'GGCCAGTGAATTGTAATACGACTCACTATAGGGAGGCGG-(T)24-VN3' [V = G, A, or C, N = G, A, C or T]. The in vitro transcription (IVT) reaction was carried out with 50% of the ds cDNA synthesized with the Bioarray HighYield RNA Transcript Labeling Kit (Enzo). Subsequently, the biotin-labeled cRNAs were purified by using RNeasy Mini spin columns and analysed on RNA Nano 6000 Chips. The cRNA target was then incubated at 94°C for 35 minutes; the resulting fragments of 50-150 nucleotides were monitored using the Bionalyzer.",
"type": "labeling",
"performer": null,
"hardware": null,
"software": null,
"standardpublicprotocol": null,
"parameter": [
"Amount of nucleic acid labeled",
"Amplification",
"Label used"
]
}
]
}
}
As you can see, it includes much more information than: https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-31/protocols/
Two questions:
protocol_description
field in the database's Experiment
model? Right now this field is only a string that is a combination of text
fields in all protocols.Submitter Provided protocol
, should we display more fields than the one that I proposed earlier? (https://github.com/AlexsLemonade/refinebio-frontend/issues/225#issuecomment-417345139)Should we record these details in protocol_description field in the database's Experiment model? Right now this field is only a string that is a combination of text fields in all protocols.
@dongbohu what, if anything, is the downside to storing these?
On the Web UI's Submitter Provided protocol, should we display more fields than the one that I proposed earlier?
No, I don't think so. The bits of information that would be "missing" in this case is the algorithm used for data processing (e.g., MicroArray Suite 5.0
) are the parts that are most likely to be irrelevant because Affymetrix samples typically have raw data associated with them. I would say if someone is very interested in this additional information, it would behoove them to do a deep dive at the source anyway. I think the cost to figuring out how to display the additionally gracefully >> the value of doing so. If users feel differently AND we've stored the details, it seems to me that we can go from there.
@jaclyn-taroni: The only downside of storing these details is probably that it makes the database slightly larger. But if these details are useful, then we should keep track of them.
I will go ahead modify the backend to save them but keep the original fields for web UI.
By the way, @jaclyn-taroni and @dvenprasad: Do you think it will be helpful to add an extra Reference
field to protocol_info
structure? We can set this field to the URL where we retrieved these protocol information. With this field on the web UI, users will be able to find more details if they want.
I like that idea @dongbohu provided it is not too big a time investment at the moment
@jaclyn-taroni 👍 The URLs are already saved either in experiment or sample models, so it is just a little bit of extra work.
Great, thanks @dongbohu
@arielsvn: In case you didn't see, after some discussions with @jaclyn-taroni (see our comments above), I tweaked the sample's protocol_info
structure a little bit. The main difference is the addition of Reference
key.
@arielsvn: I found some errors in protocol_info
in GEO samples above and I have updated it. The main difference is that most values are arrays of string (instead of a single string).
Sounds good @dongbohu. I have a question, Is it possible to add another field, and specify the type of sample?? (GEO, ArrayExpress or SRA) That way I'll know the structure of protocol_info
.
@arielsvn: Doesn't source_database
in Sample
endpoint already have that information?
You're right, I had missed that field. Thanks @dongbohu!
Backend implemented in this PR: https://github.com/AlexsLemonade/refinebio/pull/587
All submitter-processed protocol information appears to be the same, regardless of the experiment in question.
GSM182283
fromGSE7529
GSM399040
fromE-GEOD-15892
I believe this may be hardcoded?
https://github.com/AlexsLemonade/refinebio-frontend/blob/c15940459fdd39072e5c05089129f4d06990550d/src/containers/Experiment/SamplesTable.js#L349
We'll need to supply the protocol information for the sample that the user has clicked on. The information on this page is what I would expect to see for
GSM399040
.