Submitter-processed protocol information appears to be hardcoded

jaclyn-taroni commented 6 years ago

All submitter-processed protocol information appears to be the same, regardless of the experiment in question.

GSM182283 from GSE7529

screen shot 2018-08-07 at 2 35 31 pm

GSM399040 from E-GEOD-15892

screen shot 2018-08-07 at 2 37 47 pm

I believe this may be hardcoded?

https://github.com/AlexsLemonade/refinebio-frontend/blob/c15940459fdd39072e5c05089129f4d06990550d/src/containers/Experiment/SamplesTable.js#L349

We'll need to supply the protocol information for the sample that the user has clicked on. The information on this page is what I would expect to see for GSM399040.

arielsvn commented 6 years ago

Tagging @Miserlou. We'll probably need a new field on the /samples endpoint with the Submitter Supplied Protocol.

dongbohu commented 6 years ago

@Ariel, just want to clarify, for sample GSM399040, on the modal's "Submitter Supplied Protocol", you want to show all the information listed on https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM399040 or only some fields?

jaclyn-taroni commented 6 years ago

The fields that I think are important there are Extraction protocol, Label protocol, Hybridization protocol, Scan protocol and Data processing.

dongbohu commented 6 years ago

@jaclyn-taroni Do you happen to know whether the backend already retrieves and saves the information on this page: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM399040 in the database? I checked the code but didn't find it.

cgreene commented 6 years ago

Seems likely to be a @Miserlou question to me. I think he wrote that bit.

jaclyn-taroni commented 6 years ago

Also I can say, from looking at the metadata I got as part of the jackiecrunch, it appears to have protocol information in the sample annotations?

dongbohu commented 6 years ago

Since "Submitter Supplied Protocol" information totally depends on the ncbi webpage, can we simply provide that URL on our web UI? That way we don't need to worry about the synchronization of our database with ncbi website.

jaclyn-taroni commented 6 years ago

Question for @dvenprasad ☝️

jaclyn-taroni commented 6 years ago

Will also note that we'll need to do this for ArrayExpress samples as well, but ArrayExpress displays this information at the experiment level on their UI.

Here's an example: https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-31/protocols/

dongbohu commented 6 years ago

I checked the backend code and found that protocol information is saved in the database only at experiment level: https://github.com/AlexsLemonade/refinebio/blob/89db2284fcd7a853f185db4c2c6a4db15fcf8c03/common/data_refinery_common/models/models.py#L256 Is that true, @Miserlou?

Miserlou commented 6 years ago

Yes, unless there is more in source sample-specific metadata as we'll capture that too.

dvenprasad commented 6 years ago

I think 'Submitter Supplied Protocol' should be displayed in the modal. It is one of the fields that users look at while making decisions about choosing samples. Linking them to GEO would impede that process.

dongbohu commented 6 years ago

@dvenprasad: What if NCBI updated the protocol information but our database hasn't synced with the update? Wouldn't that mislead our users?

dvenprasad commented 6 years ago

I don't think that's a common occurrence. @jaclyn-taroni?

jaclyn-taroni commented 6 years ago

I agree. These accessions do get updated, but I suspect it is usually the sample labels, metadata, etc. rather than the protocols. If it a particular dataset’s protocol is that important, I would imagine that that user may contact the submitter.

dongbohu commented 6 years ago

I wonder whether it makes sense to grab the protocol information from front end dynamically (instead of saving them in backend database). That way the protocol information we show on our web UI will be always up-to-date.

cgreene commented 6 years ago

I don't think we want to grab protocol information dynamically because then our front end will also have to be able to interact with multiple potential providers (arrayexpress, ena, SRA, GEO, & potentially more if we ingest more sources).

dongbohu commented 6 years ago

Do all samples have sample-level protocol information or some only have experiement-level protocol info?

jaclyn-taroni commented 6 years ago

I believe ArrayExpress experiments have experiment-level protocol information, whereas GEO samples have sample-level protocol info (based on the metadata I'm looking at). I can't speak to SRA, though.

dongbohu commented 6 years ago

@cgreene: Do SRA samples have experiment or sample-level protocol information? If they do, can you give me one example (like a URL)?

@jaclyn-taroni: Since ArrayExpress has experiment-level protocol informaiton, does that mean on our web UI, each sample will show the protocol information of the experiment that it belongs to? Is it possible that an ArrayExpress sample are associated with multiple experiments?

cgreene commented 6 years ago

This SRA sample: https://www.ncbi.nlm.nih.gov/sra/SRX3713332[accn]

Has this BioSample accession: https://www.ncbi.nlm.nih.gov/biosample/SAMN08555006

Which links to this GEO record: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3014892

That appears to have some protocol information at the sample level. I'm not sure how deep our surveyor/downloader digs though and what is avaialble to us.

jaclyn-taroni commented 6 years ago

Have been chatting with @dongbohu -- our SRA surveyor is using EBI.

If we take a look at the sample XML data for the SRA sample @cgreene posted above: https://www.ebi.ac.uk/ena/data/view/SRS2971726&display=xml

We can see that there is no protocol information, but if we look at the experiment (e.g., SRX) XML: https://www.ebi.ac.uk/ena/data/view/SRX3713332&display=xml

it contains the library construction protocol, which is what is displayed here.

That suggests that the protocol information for SRA is at the experiment level, rather than the sample level.

Miserlou commented 6 years ago

We don't get any "sample-level" information from SRA other than what we can infer and extract, related: https://github.com/AlexsLemonade/refinebio-frontend/issues/274

On Tue, Aug 28, 2018 at 11:50 AM, Jaclyn Taroni notifications@github.com wrote:

Have been chatting with @dongbohu https://github.com/dongbohu -- our SRA surveyor is using EBI.

If we take a look at the sample XML data for the SRA sample @cgreene https://github.com/cgreene posted above https://www.ncbi.nlm.nih.gov/sra/SRX3713332%5Baccn%5D: https://www.ebi.ac.uk/ena/data/view/SRS2971726&display=xml

We can see that there is no protocol information, but if we look at the experiment (e.g., SRX) XML: https://www.ebi.ac.uk/ena/ data/view/SRX3713332&display=xml

it contains the library construction protocol, which is what is displayed here https://www.ncbi.nlm.nih.gov/sra/SRX3713332%5Baccn%5D.

That suggests that the protocol information for SRA is at the experiment level, rather than the sample level.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AlexsLemonade/refinebio-frontend/issues/225#issuecomment-416637976, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIi0yCqGB1Nk79hy7Zi_amCq5fd0k5yks5uVWawgaJpZM4Vyrtw .

dongbohu commented 6 years ago

@arielsvn: Here is the structure of the protocol information that the sample API will offer: Field name: protocol_info

For GEO samples, this field is a dictionary with multiple key/value pairs:

{
  'Extraction protocol': [ string, ... ], 
  'Label protocol': [ string, ... ],
  'Hybridization protocol': [ string, ... ],
  'Scan protocol': [ string, ... ],
  'Data processing': [ string, ... ],
  'Reference': string_of_URL
}

Each value is an array of strings (but most arrays include only one string). The values of some of the keys may be blank. If that happens, you can either set NA or ignore them on the web UI.

For Array Express samples, the protocol info will be an array of dictionaries:

[
   { 
      Accession: string,
      Text: string,
      Type: string,
      Reference: string_of_URL
  },
  { 
      Accession: string,
      Title: string,
      Type: string,
      Description: string
      Reference: string_of_URL
  },
  ...
]

For SRA samples, the protocol info will be an array of dictionaries:

[ 
  { 
      Description: string,
      Reference: string_of_URL
  },
  { 
      Description: string,
      Reference: string_of_URL
  },
  ... 
]

But most likely this array will include one and only one element.

Please let me know if you have any questions.

dongbohu commented 6 years ago

@arielsvn: In case protocol_info you get from each sample is an empty object or array, you can simply put NA there. You can ask @dvenprasad for confirmation.

arielsvn commented 6 years ago

Thanks @dongbohu for the detailed explanation! I do have some questions:

Is it possible to add another field, and specify the type of sample?? To avoid having to test the structure of protocol_info in order to determine the sample type.
Since the pipelines field is obsolete. How are we going to be able to determine which samples are Submitter Processed?? Before pipelines == ['Submitter-processed'] is what identified these samples. For submitter processed samples we should display something similar to this design.
@dvenprasad The submitter processed protocol information will be different depending on the type of sample, as @dongbohu mentioned above. How the dialog should look for each case??

dvenprasad commented 6 years ago

@jaclyn-taroni @dongbohu For ArrayExpress, are the accession fields samples accessions?

EDIT: For SRA, do each of the strings in the string array map to a sample or is it independent?

Would it be possible for me get some sample data from this API endpoint so I design a reasonable way to represent it on the UI?

dongbohu commented 6 years ago

@dvenprasad Here is an example of ArrayExpress sample's protocol info: https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-31/protocols/

For SRA, the protocol info is in <LIBRARY_CONSTRUCTION_PROTOCOL> field. See this example: https://www.ebi.ac.uk/ena/data/view/SRX3713332&display=xml

If an SRA sample is associated with multiple experiments, this LIBRARY_CONSTRUCTION_PROTOCOL could be different (but very unlikely), that is why I set its type to an array of strings, but in most (if not all) cases, it should have only one string.

dvenprasad commented 6 years ago

Thanks @dongbohu.

Follow-up question( most likely for @jaclyn-taroni) about the accession codes on ArrayExpress. They look like they are accession IDs for protocols. Is that correct?

If so, do these values mean anything outside of ArrayExpress? i.e does it make sense to display them on the UI?

jaclyn-taroni commented 6 years ago

If so, do these values mean anything outside of ArrayExpress?

Not to my knowledge. My suspicion is that the accession code is not that useful AND there are over half a million unique protocols apparently: https://www.ebi.ac.uk/arrayexpress/protocols/browse.html so, to me, that lends some weight to my gut feeling.

dongbohu commented 6 years ago

@jaclyn-taroni and @dvenprasad: I just found out that EBI provides the following detailed protocol information for each ArrayExpress experiment: https://www.ebi.ac.uk/arrayexpress/json/v3/experiments/E-MEXP-31/protocols

I reformatted the JSON content provided by this URL to make it more readable:

{
  "protocols": {
    "api-version": 3,
    "api-revision": "091015",
    "version":1.0,
    "revision": "091015",
    "total-protocols": 6,
    "protocol": [
      {
    "id": 63,
    "accession": "Affymetrix:Protocol:ExpressionStat",
    "name": "Affymetrix:Protocol:ExpressionStat",
    "text": "Title: Affymetrix CHP Analysis (ExpressionStat). Description:",
    "type": "bioassay_data_transformation",
    "performer": null,
    "hardware": null,
    "software": "MicroArraySuite 5.0",
    "standardpublicprotocol": 1,
    "parameter": [
      "Algorithm name",
      "Algorithm version",
      "Alpha1",
      "Alpha2",
      "Baseline file",
      "Gamma1H",
      "Gamma1L",
      "Gamma2H",
      "Gamma2L",
      "Mask file",
      "Normalization factor",
      "Perturbation",
      "Scale factor",
      "Tau"
    ]
      },

      {
    "id": 79337,
    "accession": "Affymetrix:Protocol:Hybridization-EukGE-WS2v4[]",
    "name": "Affymetrix:Protocol:Hybridization-EukGE-WS2v4[]",
    "text": "Title: Affymetrix EukGE-WS2v4 Hybridization. Description:",
    "type": "hybridization",
    "performer": null,
    "hardware": null,
    "software": "MicroArraySuite 5.0",
    "standardpublicprotocol":1
      },

      {
    "id":51,
    "accession":"P-AFFY-6",
    "name":"P-AFFY-6",
    "text":"Title: Affymetrix CEL analysis. Description:",
    "type":"feature_extraction",
    "performer":null,
    "hardware": "418 [Affymetrix]",
    "software": "MicroArraySuite 5.0",
    "standardpublicprotocol":1
      },

      {
    "id": 230174,
    "accession": "P-MEXP-1358",
    "name": "P-MEXP-1358",
    "text": "Pachytene spermatocytes and early spermatids were prepared by centrifugal elutriation to a purity greater than 90% from 8 rats at 90 dpp as previously described except that cells were mechanically dispersed (28).  Purified cells were centrifuged, snap frozen in liquid nitrogen and stored at - 80°C.  Total testicular samples were produced by excising and snap freezing testes from three Sprague-Dawley rats at 90 dpp in liquid nitrogen.  The outermost connective tissue capsule was then surgically removed on the frozen organs before they were manually ground using a ceramic mortar and pistil.  Total RNA was purified using the RNeasy kit (Qiagen) following the manufacturer’s instructions.  The RNA was immediately snap frozen in liquid nitrogen and stored at –80°C.   Samples from brain (Lewis, 60 dpp) and skeletal muscle (Wistar, 70 dpp) were isolated from adult rats according to standard procedures. ",
    "type": "pool",
    "performer": null,
    "hardware": null,
    "software": null,
    "standardpublicprotocol": null
      },

      {
    "id": 230175,
    "accession": "P-MEXP-1359",
    "name": "P-MEXP-1359",
    "text": "Total RNA was purified using the RNeasy kit (Qiagen) following the manufacturer’s instructions.  The RNA was immediately snap frozen in liquid nitrogen and stored at –80°C. ",
    "type": "nucleic_acid_extraction",
    "performer": null,
    "hardware": null,
    "software": null,
    "standardpublicprotocol": null,
    "parameter": [
      "Amplification",
      "Extracted product"
    ]
      },

      {
    "id": 230173,
    "accession": "P-MEXP-1360",
    "name": "P-MEXP-1360",
    "text": "Total RNA was prepared using RNeasy Mini-Spin columns (Qiagen) using standard protocols.  RNA quality was monitored with  RNA Nano 6000 Chips and the 2100 Bioanalyzer (Agilent).  Labeling of total RNA was performed as described in the Expression Analysis Technical Manual (Affymetrix) with minor modifications as indicated below.  Double-stranded (ds) cDNA was synthesized from 13 µg of total RNA using the Superscript II kit (Invitrogen Life Technologies) and a T7-(dT)24-VN primer 5'GGCCAGTGAATTGTAATACGACTCACTATAGGGAGGCGG-(T)24-VN3' [V = G, A, or C, N = G, A, C or T].  The in vitro transcription (IVT) reaction was carried out with 50% of the ds cDNA synthesized with the Bioarray HighYield RNA Transcript Labeling Kit (Enzo).  Subsequently, the biotin-labeled cRNAs were purified by using RNeasy Mini spin columns and analysed on RNA Nano 6000 Chips. The cRNA target was then incubated at 94°C for 35 minutes; the resulting fragments of 50-150 nucleotides were monitored using the Bionalyzer.",
    "type": "labeling",
    "performer": null,
    "hardware": null,
    "software": null,
    "standardpublicprotocol": null,
    "parameter": [
      "Amount of nucleic acid labeled",
      "Amplification",
      "Label used"
    ]
      }
    ]
  }
}

As you can see, it includes much more information than: https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-31/protocols/

Two questions:

Should we record these details in protocol_description field in the database's Experiment model? Right now this field is only a string that is a combination of text fields in all protocols.
On the Web UI's Submitter Provided protocol, should we display more fields than the one that I proposed earlier? (https://github.com/AlexsLemonade/refinebio-frontend/issues/225#issuecomment-417345139)

jaclyn-taroni commented 6 years ago

Should we record these details in protocol_description field in the database's Experiment model? Right now this field is only a string that is a combination of text fields in all protocols.

@dongbohu what, if anything, is the downside to storing these?

On the Web UI's Submitter Provided protocol, should we display more fields than the one that I proposed earlier?

No, I don't think so. The bits of information that would be "missing" in this case is the algorithm used for data processing (e.g., MicroArray Suite 5.0) are the parts that are most likely to be irrelevant because Affymetrix samples typically have raw data associated with them. I would say if someone is very interested in this additional information, it would behoove them to do a deep dive at the source anyway. I think the cost to figuring out how to display the additionally gracefully >> the value of doing so. If users feel differently AND we've stored the details, it seems to me that we can go from there.

dongbohu commented 6 years ago

@jaclyn-taroni: The only downside of storing these details is probably that it makes the database slightly larger. But if these details are useful, then we should keep track of them.

I will go ahead modify the backend to save them but keep the original fields for web UI.

dongbohu commented 6 years ago

By the way, @jaclyn-taroni and @dvenprasad: Do you think it will be helpful to add an extra Reference field to protocol_info structure? We can set this field to the URL where we retrieved these protocol information. With this field on the web UI, users will be able to find more details if they want.

jaclyn-taroni commented 6 years ago

I like that idea @dongbohu provided it is not too big a time investment at the moment

dongbohu commented 6 years ago

@jaclyn-taroni 👍 The URLs are already saved either in experiment or sample models, so it is just a little bit of extra work.

jaclyn-taroni commented 6 years ago

Great, thanks @dongbohu

dongbohu commented 6 years ago

@arielsvn: In case you didn't see, after some discussions with @jaclyn-taroni (see our comments above), I tweaked the sample's protocol_info structure a little bit. The main difference is the addition of Reference key.

dongbohu commented 6 years ago

@arielsvn: I found some errors in protocol_info in GEO samples above and I have updated it. The main difference is that most values are arrays of string (instead of a single string).

arielsvn commented 6 years ago

Sounds good @dongbohu. I have a question, Is it possible to add another field, and specify the type of sample?? (GEO, ArrayExpress or SRA) That way I'll know the structure of protocol_info.

dongbohu commented 6 years ago

@arielsvn: Doesn't source_database in Sample endpoint already have that information?

arielsvn commented 6 years ago

You're right, I had missed that field. Thanks @dongbohu!

dongbohu commented 6 years ago

Backend implemented in this PR: https://github.com/AlexsLemonade/refinebio/pull/587

AlexsLemonade / refinebio-frontend