AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Deliver and display selected sample annotations in table #515

Closed jaclyn-taroni closed 6 years ago

jaclyn-taroni commented 6 years ago

Context

Jackiecrunch (#466)

Problem or idea

We are performing some smoothing of the sample metadata and these harmonized values are the ones being displayed in the samples table view. However, we smooth a limited number of fields, so information that may be important to a user, such as the presence or absence of a particular disease manifestation, may not be available in this table. In addition, when return sample annotations to users in the TSV metadata files, it is essentially JSON in a column. This will be of limited utility to users without considerable programming savvy.

Here's an example of a sample from GEO that has been imported into ArrayExpress:

"samples": {
        "IFNa DC_LB016_IFNa": {
            "accession_code": "E-GEOD-44719-GSM1089311",
            "age": "",
            "annotations": [
                {
                    "detected_platform": "illuminaHumanv3",
                    "detection_percentage": 98.44078,
                    "mapped_percentage": 100.0
                },
                {
                    "assay": {
                        "name": "GSM1089311"
                    },
                    "characteristic": [
                        {
                            "category": "cell population",
                            "value": "IFNa DC"
                        },
                        {
                            "category": "donor id",
                            "value": "LB016"
                        },
                        {
                            "category": "organism",
                            "value": "Homo sapiens"
                        },
                        {
                            "category": "stimulation",
                            "value": "IFNa"
                        },
                        {
                            "category": "tissue type",
                            "value": "whole blood"
                        }
                    ],
                    "extract": {
                        "name": "GSM1089311 extract 1"
                    },
                    "file": [
                        {
                            "comment": {
                                "name": "Derived ArrayExpress FTP file",
                                "value": "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/GEOD/E-GEOD-44719/E-GEOD-44719.processed.1.zip"
                            },
                            "name": "GSM1089311_sample_table.txt",
                            "type": "derived data",
                            "url": "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/GEOD/E-GEOD-44719/E-GEOD-44719.processed.1.zip/GSM1089311_sample_table.txt"
                        }
                    ],
                    "labeled-extract": {
                        "label": "biotin",
                        "name": "GSM1089311 LE 1"
                    },
                    "source": {
                        "comment": [
                            {
                                "name": "Sample_description",
                                "value": "4447846325_G"
                            },
                            {
                                "name": "Sample_source_name",
                                "value": "IFNa DC_IFNa"
                            },
                            {
                                "name": "Sample_title",
                                "value": "IFNa DC_LB016_IFNa"
                            }
                        ],
                        "name": "GSM1089311 1"
                    },
                    "variable": [
                        {
                            "name": "cell population",
                            "value": "IFNa DC"
                        },
                        {
                            "name": "donor id",
                            "value": "LB016"
                        },
                        {
                            "name": "dose",
                            "value": "1 mL"
                        },
                        {
                            "name": "stimulation",
                            "value": "IFNa"
                        }
                    ]
                }
            ],
            "cell_line": "",
            "compound": "",
            "disease": "",
            "disease_stage": "",
            "genotype": "",
            "organism": "HOMO_SAPIENS",
            "platform": "Illumina_HumanHT-12_V3.0 (Illumina_HumanHT-12_V3.0)",
            "race": "",
            "sample_id": "IFNa DC_LB016_IFNa",
            "sex": "",
            "source_archive_url": "https://www.ebi.ac.uk/arrayexpress/json/v3/experiments/E-GEOD-44719/samples",
            "specimen_part": "whole blood",
            "subject": "lb016",
            "time": "",
            "title": "IFNa DC_LB016_IFNa",
            "treatment": ""
        },
        ...

For this sample, I would like to see the information contained in variable and characteristic table form. For example:

"variable": [
                        {
                            "name": "cell population",
                            "value": "IFNa DC"
                        }
            ]

cell population would be the header and the value would be IFNa DC. Note that some of the fields in characteristic and variable are redundant (pointed here), so retaining only one where there would be duplicates would be quite helpful to users.

Here's an example from GEO:

"samples": {
        "Bone.Marrow_OA_No_ST01": {
            "accession_code": "GSM1361046",
            "age": "",
            "annotations": [
                {
                    "channel_count": [
                        "1"
                    ],
                    "characteristics_ch1": [
                        "tissue: Bone Marrow",
                        "disease: OA",
                        "serum: Low Serum"
                    ],
                    "contact_address": [
                        "Crown Street"
                    ],
                    "contact_city": [
                        "Liverpool"
                    ],
                    "contact_country": [
                        "United Kingdom"
                    ],
                    "contact_email": [
                        "p.antczak@liv.ac.uk"
                    ],
                    "contact_institute": [
                        "University of Liverpool"
                    ],
                    "contact_name": [
                        "Philipp,,Antczak"
                    ],
                    "contact_state": [
                        "Merseyside"
                    ],
                    "contact_zip/postal_code": [
                        "L69 7ZB"
                    ],
                    "data_processing": [
                        "Data was processed and normalized using the RMA methodology. Probes that were called absent in 50% of samples have been removed."
                    ],
                    "data_row_count": [
                        "21742"
                    ],
                    "extract_protocol_ch1": [
                        "Fibroblasts were retrieved from culture by trypsin digestion and washed in fibroblast medium then PBS before resuspending in 200l of RLT RNA protection buffer (Quiagen) followed by the manufacturer instructions of the RNeasy extraction kit."
                    ],
                    "geo_accession": [
                        "GSM1361046"
                    ],
                    "growth_protocol_ch1": [
                        "Cells were seeded in duplicate at the required density using 10% FCS. Once attached, the medium was removed and the cells washed in serum free medium before replacing with fibroblast medium, which was identical but contained either 10% or 0.1% FCS."
                    ],
                    "hyb_protocol": [
                        "Hybridization of Chips was performed as per manufacturers instructions."
                    ],
                    "label_ch1": [
                        "biotin"
                    ],
                    "label_protocol_ch1": [
                        "50ng of total RNA was amplified using Nugen Ovation Biotin RNA amplification"
                    ],
                    "last_update_date": [
                        "Jun 22 2015"
                    ],
                    "molecule_ch1": [
                        "total RNA"
                    ],
                    "organism_ch1": [
                        "Homo sapiens"
                    ],
                    "platform_id": [
                        "GPL570"
                    ],
                    "scan_protocol": [
                        "SData were scanned as per manufactureres instructions on a GeneArray Scanner."
                    ],
                    "series_id": [
                        "GSE56409"
                    ],
                    "source_name_ch1": [
                        "Bone Marrow"
                    ],
                    "status": [
                        "Public on Jun 22 2015"
                    ],
                    "submission_date": [
                        "Apr 01 2014"
                    ],
                    "supplementary_file": [
                        "NONE"
                    ],
                    "taxid_ch1": [
                        "9606"
                    ],
                    "title": [
                        "Bone.Marrow_OA_No_ST01"
                    ],
                    "type": [
                        "RNA"
                    ]
                }
            ],
            "cell_line": "",
            "compound": "",
            "disease": "oa",
            "disease_stage": "",
            "genotype": "",
            "organism": "HOMO_SAPIENS",
            "platform": "Affymetrix Human Genome U133 Plus 2.0 Array (hgu133plus2)",
            "race": "",
            "sample_id": "Bone.Marrow_OA_No_ST01",
            "sex": "",
            "source_archive_url": "",
            "specimen_part": "bone marrow",
            "subject": "",
            "time": "",
            "title": "Bone.Marrow_OA_No_ST01",
            "treatment": ""
        }, 
        ...

For this example, I am interested in what is contained in the characteristics field:

  "characteristics_ch1": [
                        "tissue: Bone Marrow",
                        "disease: OA",
                        "serum: Low Serum"
                    ]

Where, ideally, tissue, disease, and serum would all be column names populated with their respective values. We capture the OA and bone marrow information in our smoothing, but the serum treatment (a key part of the experimental design) is not included.

I know there has been some hesitance to venture into this because the presence of nested values has been mentioned. At the moment, I am not sure why are we able to smooth some of the fields but not extract the others such that they are easily displayed or delivered in a tabular format, particularly if we are able to identify the specific fields (e.g., characteristic) that would need to be extracted.

I believe this effort would reduce friction for a large portion of our user base.

Solution or next step

~Can we run SRP056295, an experiment from SRA, and E-MEXP-31, ArrayExpress data that has not been imported from GEO, through our system? I do not currently have SRA or ArrayExpress submission metadata on hand, so I can not make a recommendation about what fields to extract yet. Some more examples from ArrayExpress and GEO would probably be helpful, too. These are the ones I happen to have on hand at the moment.~

See comment below about zebrafish experiments.

New Issue Checklist

jaclyn-taroni commented 6 years ago

Forgot to attach metadata_examples.zip

cgreene commented 6 years ago

I agree that we need to be able to add unsmoothed data to the table to provide our users with something is useful to then. I am confused about why nested values cannot be exported. It seems like they could be as long as there was a determination of what character we would use to denote a hierarchy. In rare cases, this could conflict with characters used in the sample annotations, but that seems far better than making these data very difficult to access (i.e., leaving them as JSON only which - though I prefer it - most of our users will find inscrutable and unusable).

wvauclain commented 6 years ago

It's important to note that whatever solution we have for this should probably also modify the work done in https://github.com/AlexsLemonade/refinebio/pull/484 to add the custom fields to the metadata_fields method so they can be displayed in the Sample Metadata box on the frontend.

jaclyn-taroni commented 6 years ago

Note to self: Here are some zebrafish experiments (#227) that it would be helpful to dig into in service of looking at what sample metadata is coming from the different sources:

ERP004809
ERP000447
SRP065208
SRP033369
SRP012376 
E-TABM-105
E-TABM-33
E-MTAB-2207
E-MEXP-2215
E-GEOD-41696
E-GEOD-63873
E-GEOD-39842
GSE94532
GSE52873
GSE4201
cgreene commented 6 years ago

I don't have enough information on what we are going to provide and how to provide it to write anything useful on https://github.com/AlexsLemonade/refinebio-docs/issues/3

dongbohu commented 6 years ago

@jaclyn-taroni: One question for you: Are the extra fields that you want to see on the front end static? In other words, for ArrayExpress samples, you always want to see the fields in variable group (if this group is available in the database of course), and for GEO samples, you always want to see the fields incharacteristics_ch1?

And what about SRA samples?

jaclyn-taroni commented 6 years ago

@dongbohu let's focus on putting everything (all sample annotations) into tabular format. We don't have enough information to say if the extra fields are static at the moment. Would you prefer I file a separate "all sample annotations" issue assigned for this milestone?

dongbohu commented 6 years ago

You mean showing ALL sample annotation fields on web UI like this one: https://staging.refine.bio/experiments/2?ref=search or only the file that the user downloads or both? Showing them on the web UI could be distracting to users because a sample may have lots of annotation fields.

jaclyn-taroni commented 6 years ago

I would say both for now. Let's see if @dvenprasad agrees with that. I agree with you that it is not ideal and could be distracting. Hopefully in the near future we'll be able to identify static extra fields for display, but I'm not sure if that's possible yet.

jaclyn-taroni commented 6 years ago

I'll add that the plan would be to display this information in columns to the right of the processing information column (related: AlexsLemonade/refinebio-frontend#215) if I recall correctly.

dvenprasad commented 6 years ago

Displaying all the unnormalized fields after processing information was part of the original plan. It seems like the data is too unwieldy to that in its current state.

Based on the conversation on https://github.com/AlexsLemonade/refinebio-frontend/issues/215, I understand that the json is present on a per sample basis. Would it be possible/make sense to display that on a modal on a per sample basis instead? We can have a column at the end 'Additional Metadata'

Also, I poked around a few metadata.json files, it looks like contact info repeats for every sample, would it be possible to "remove" it for display purposes? They can still be part of the download file.

dongbohu commented 6 years ago

@dvenprasad So you mean on the samples table like the one on this page: https://staging.refine.bio/experiments/2?ref=search Add a new column Additional Metadata on the right side of the table, and the value in the cell is a link, clicking which will launch a modal that shows all annotation data in JSON format?

About the duplicate values in contact info field: this field is not always shown on each table of samples. After checking the code, I found that the fields shown in a samples table are determined dynamically by how many values the field has. Is it a decision that you guys made some time ago?

jaclyn-taroni commented 6 years ago

Provided this is not particularly tricky to accomplish @dongbohu, I like your + @dvenprasad's idea

Add a new column Additional Metadata on the right side of the table, and the value in the cell is a link, clicking which will launch a modal that shows all annotation data in JSON format?

For this question

After checking the code, I found that the fields shown in a samples table are determined dynamically by how many values the field has. Is it a decision that you guys made some time ago?

Yes, I believe that's what was discussed here https://github.com/AlexsLemonade/refinebio-frontend/issues/23#issuecomment-393302301. I'm surprised that contact info is ever shown on the sample table.

As far as this suggestion by @dvenprasad goes:

Also, I poked around a few metadata.json files, it looks like contact info repeats for every sample, would it be possible to "remove" it for display purposes? They can still be part of the download file.

I believe this is in reference to the Additional Metadata modal, rather than the sample table. For this release, we might consider hiding the contact fields to be of the "nice to have" rather than "must have" priority. What do you all think?

cgreene commented 6 years ago

For this release, we might consider hiding the contact fields to be of the "nice to have" rather than "must have" priority. What do you all think?

👍

jaclyn-taroni commented 6 years ago

I've filed a new ticket for the current milestone (#532)

dongbohu commented 6 years ago

@jaclyn-taroni: Attached are two tsv files generated based on the following abridged metadata:

        {
            'experiments': {
                "E-GEOD-44719": {
                    "accession_code": "E-GEOD-44719",
                    "sample_titles": [ "IFNa DC_LB016_IFNa" ]
                }
            },

            'samples': {
                "IFNa DC_LB016_IFNa": {  # Sample #1 is an ArrayExpress sample
                    "accession_code": "E-GEOD-44719-GSM1089311",
                    "source_database": "ARRAY_EXPRESS",
                    "annotations": [
                        # annotation #1
                        {
                            "detected_platform": "illuminaHumanv3",
                            "detection_percentage": 98.44078,
                            "mapped_percentage": 100.0
                        },
                        # annotation #2
                        {
                            "assay": { "name": "GSM1089311" },

                            # Special field that will be taken out as separate columns
                            "characteristic": [
                                { "category": "cell population",
                                  "value": "IFNa DC"
                                },
                                { "category": "donor id",
                                  "value": "LB016"
                                }
                            ],

                            # Another special field in Array Express sample
                            "variable": [
                                { "name": "dose",
                                  "value": "1 mL"
                                },
                                { "name": "stimulation",
                                  "value": "IFNa"
                                }
                            ],

                            "extract": { "name": "GSM1089311 extract 1" }
                        }
                    ]  # end of annotations
                },  # end of sample #1

                "Bone.Marrow_OA_No_ST03": {  # Sample #2 is a GEO sample
                    "accession_code": "GSM1361050",
                    "annotations": [
                        {
                            "channel_count": [ "1" ],

                            # Special field that will be taken out as separate columns
                            "characteristics_ch1": [
                                "tissue: Bone Marrow",
                                "disease: OA",
                                "serum: Low Serum"
                            ],

                            "contact_address": [ "Crown Street" ],
                            "contact_country": [ "United Kingdom" ],
                            "data_processing": [ "Data was processed and normalized" ],
                            "geo_accession": [ "GSM1361050" ],
                        }
                    ],  # end of annotations

                    "organism": "HOMO_SAPIENS",
                    "source_database": "GEO"
                }  # end of sample #2

            }  # end of "samples"
        }

metadata.tsv.txt E-GEOD-44719_metadata.tsv.txt (Both files were TSV files, but I had to change their suffixes to attach to github. You can remove .txt suffix after downloading them.)

metadata.tsv.txt includes both samples, E-GEOD-44719_metadata.tsv.txt includes only one sample, because it was aggregated by the experiment.

Can you confirm that they are in the format that you expect?

dongbohu commented 6 years ago

A few points that you may or may not like:

jaclyn-taroni commented 6 years ago

@dongbohu the tsv files are mostly as I would expect. The only things I find kind of odd are the brackets in the assay and extract fields (e.g., {'name': 'GSM1089311'}).

Thanks for the detailed points!

  • The columns in both files are sorted in alphabetic order.
  • I added source_database column to make the implementation easier.

These changes both sound good to me.

  • For ArrayExpress samples, if a field is available in both characteristic and variable inside annotation, the value will be silently overwritten. In other words, I am assuming that their values are always identical.

I am less happy about this one 😉 -- I will note it in the docs (https://github.com/AlexsLemonade/refinebio-docs/issues/4#issuecomment-421376590), though, as we already have caveats around the metadata. Is there a way to throw a warning or something? I'm specifically concerned with having a way to track how often this happens (if at all).

dongbohu commented 6 years ago

@jaclyn-taroni: As you see in the sample metadata, the values of assay and extract fields are both object literal:

"assay": { "name": "GSM1089311" },
"extract": { "name": "GSM1089311 extract 1" }

If we are sure that the values of these two fields always include only one key/value pair and the key is always name, I can convert these two values into GSM1089311 and GSM1089311 extract 1.

About the value conflicts between characteristic and variable in ArrayExpress samples, I can print out some warning message when a conflict is detected.

I am afraid we can not avoid { ...} pair in the tsv file. For example, in the following ArrayExpress annotation fields:

'labeled-extract': {
    'name': 'GSM1288968 LE 1', 
    'label': 'biotin'
},

'source': {
    'name': 'GSM1288968 1',
    'comment': [
        { 
            'name': 'Sample_source_name',
            'value': 'pineal glands at CT18, after light exposure'
        },
        { 
            'name': 'Sample_title',
            'value': 'Pineal_Light_CT18'
        }
    ]
},
...

How can we "flatten" them completely?

dongbohu commented 6 years ago

Another problem that I just found out: Given the following ArrayExpress sample's annotation field:

        'characteristic': [
            { 'category': 'age',
              'value': 'adult (0.5-1.5 years old)'
            },
            { 'category': 'organism',
              'value': 'Danio rerio'
            },
            ...
            { 'category': 'sex',
              'value': 'males and females'
            },
            ...
        ],

age, organism and sex will be taken out as separate columns in the tsv file. But the sample may already has its own harmonized values of age, organism, sex etc. So these annotation values probably should never overwrite the harmonized values.

In other words, if an annotation field name is identical to a sample's own attribute, that annotation field should be ignored. Does this make sense?

dongbohu commented 6 years ago

Another idea to avoid the field name conflicts is to modify the field names in annotations, for example, change age in 'characteristic' to something like annotation_characteristic_age, etc.

jaclyn-taroni commented 6 years ago

Chatted with @cgreene - let's prefix the harmonized values with refinebio_, so the age in characteristic would be age and our harmonized values would be in refinebio_age

dongbohu commented 6 years ago

The harmonized values are all the fields on lines 129-145: https://github.com/AlexsLemonade/refinebio/blob/dev/common/data_refinery_common/models/models.py#L127 right?

jaclyn-taroni commented 6 years ago

These are the ones I'm thinking about specifically

metadata['sex'] = self.sex
metadata['age'] = self.age or ''
metadata['specimen_part'] = self.specimen_part
metadata['genotype'] = self.genotype
metadata['disease'] = self.disease
metadata['disease_stage'] = self.disease_stage
metadata['cell_line'] = self.cell_line
metadata['treatment'] = self.treatment
metadata['race'] = self.race
metadata['subject'] = self.subject
metadata['compound'] = self.compound
metadata['time'] = self.time

(Correct @Miserlou ?)

I'll also note that genotype should be genetic information, see: https://github.com/AlexsLemonade/refinebio/pull/252#discussion_r188424351 and that's what is in our documentation.

Would it be possible to have you correct this while you're working on this issue @dongbohu ?

EDIT: re: the genetic informationlabel -- we should carefully consider how to proceed given the milestone. Using genotype is not ideal, but I don't want to hold anything up.

dongbohu commented 6 years ago

@jaclyn-taroni:

jaclyn-taroni commented 6 years ago

@dongbohu Okay both of those like a plan 👍

Miserlou commented 6 years ago

This is a super long thread to be summoned in, give me a sec..

Having both age and refinebio_age as direct fields on the same model seems like a bad idea to me, and leads down a bad path of solving problems by throwing more database fields at them. My gut says the better solution is just to have a more robust serializer/renderer for JSON annotations. (I really think a lot of this boils down to the fact that TSV is a really bad format for dealing with data with varying structures.)

We definitely need to check that genotype/genetic_information everywhere, I think that was a later change.

dongbohu commented 6 years ago

@Miserlou: @jaclyn-taroni and I are not talking about creating new fields such as refinebio_age in Sample model. Instead, we were talking about changing line 134 from:

metadata['age'] = self.age or ''

to:

metadata['refinebio_age'] = self.age or ''

so that the TSV file generated by smasher will have one column refinebio_age and possibly another column age (if it is found in annotation).

Considering the deadline, I think it makes sense to rename the TSV file column to genetic_information but keep the model's field as is for now. Of course we should file a new issue to keep track of it.

dongbohu commented 6 years ago

@jaclyn-taroni: I checked the current implimentation of array express sample harmonize(). Here is what I found out:

The harmonized metadata fields of ArrayExpress sample are based on sample's SDRF file, here is an example: https://www.ebi.ac.uk/arrayexpress/files/E-GEOD-44719/E-GEOD-44719.sdrf.txt This is a TSV file, its first four columns:

Source Name
Comment [Sample_description]    
Comment [Sample_source_name]    
Comment [Sample_title]

correspond to the source object in this sample annotation: https://www.ebi.ac.uk/arrayexpress/json/v3/experiments/E-GEOD-44719/samples (each sample's annotation data is an entry nested in experiment --> sample array)

During harmonization, Comment [Sample_title] is saved as the sample's title, the other three fields are discarded. The question is, do we want to keep the other information in source or we can ignore them?

jaclyn-taroni commented 6 years ago

@dongbohu my gut feeling is to keep the information but not worry about including it in the TSV -- would that work?

dongbohu commented 6 years ago

@jaclyn-taroni: You mean keeping source in sample annotations (so that it will be available in JSON file) but skip it in TSV file?

jaclyn-taroni commented 6 years ago

@dongbohu yes, in that particular example the sample description field is important if going back to the processed data in the source repository (e.g., ArrayExpress), but it seems like a thing that only a select, more computational group of users would want to know (= JSON is okay)

dongbohu commented 6 years ago

@jaclyn-taroni: Sounds good to me!

dongbohu commented 6 years ago

@jaclyn-taroni: I realized that sample_id column in the TSV file is redundant, because its value is always identical to refinebio_title. Can we remove it?

jaclyn-taroni commented 6 years ago

Sure. We’ll find out if our users find that terminology to be confusing.

dongbohu commented 6 years ago

Closed by PR #629.