Closed jaclyn-taroni closed 6 years ago
Forgot to attach metadata_examples.zip
I agree that we need to be able to add unsmoothed data to the table to provide our users with something is useful to then. I am confused about why nested values cannot be exported. It seems like they could be as long as there was a determination of what character we would use to denote a hierarchy. In rare cases, this could conflict with characters used in the sample annotations, but that seems far better than making these data very difficult to access (i.e., leaving them as JSON only which - though I prefer it - most of our users will find inscrutable and unusable).
It's important to note that whatever solution we have for this should probably also modify the work done in https://github.com/AlexsLemonade/refinebio/pull/484 to add the custom fields to the metadata_fields
method so they can be displayed in the Sample Metadata
box on the frontend.
Note to self: Here are some zebrafish experiments (#227) that it would be helpful to dig into in service of looking at what sample metadata is coming from the different sources:
ERP004809
ERP000447
SRP065208
SRP033369
SRP012376
E-TABM-105
E-TABM-33
E-MTAB-2207
E-MEXP-2215
E-GEOD-41696
E-GEOD-63873
E-GEOD-39842
GSE94532
GSE52873
GSE4201
I don't have enough information on what we are going to provide and how to provide it to write anything useful on https://github.com/AlexsLemonade/refinebio-docs/issues/3
@jaclyn-taroni: One question for you: Are the extra fields that you want to see on the front end static? In other words, for ArrayExpress samples, you always want to see the fields in variable
group (if this group is available in the database of course), and for GEO samples, you always want to see the fields incharacteristics_ch1
?
And what about SRA samples?
@dongbohu let's focus on putting everything (all sample annotations) into tabular format. We don't have enough information to say if the extra fields are static at the moment. Would you prefer I file a separate "all sample annotations" issue assigned for this milestone?
You mean showing ALL sample annotation fields on web UI like this one: https://staging.refine.bio/experiments/2?ref=search or only the file that the user downloads or both? Showing them on the web UI could be distracting to users because a sample may have lots of annotation fields.
I would say both for now. Let's see if @dvenprasad agrees with that. I agree with you that it is not ideal and could be distracting. Hopefully in the near future we'll be able to identify static extra fields for display, but I'm not sure if that's possible yet.
I'll add that the plan would be to display this information in columns to the right of the processing information column (related: AlexsLemonade/refinebio-frontend#215) if I recall correctly.
Displaying all the unnormalized fields after processing information was part of the original plan. It seems like the data is too unwieldy to that in its current state.
Based on the conversation on https://github.com/AlexsLemonade/refinebio-frontend/issues/215, I understand that the json is present on a per sample basis. Would it be possible/make sense to display that on a modal on a per sample basis instead? We can have a column at the end 'Additional Metadata'
Also, I poked around a few metadata.json files, it looks like contact info repeats for every sample, would it be possible to "remove" it for display purposes? They can still be part of the download file.
@dvenprasad So you mean on the samples table like the one on this page:
https://staging.refine.bio/experiments/2?ref=search
Add a new column Additional Metadata
on the right side of the table, and the value in the cell is a link, clicking which will launch a modal that shows all annotation data in JSON format?
About the duplicate values in contact info
field: this field is not always shown on each table of samples. After checking the code, I found that the fields shown in a samples table are determined dynamically by how many values the field has. Is it a decision that you guys made some time ago?
Provided this is not particularly tricky to accomplish @dongbohu, I like your + @dvenprasad's idea
Add a new column Additional Metadata on the right side of the table, and the value in the cell is a link, clicking which will launch a modal that shows all annotation data in JSON format?
For this question
After checking the code, I found that the fields shown in a samples table are determined dynamically by how many values the field has. Is it a decision that you guys made some time ago?
Yes, I believe that's what was discussed here https://github.com/AlexsLemonade/refinebio-frontend/issues/23#issuecomment-393302301. I'm surprised that contact info
is ever shown on the sample table.
As far as this suggestion by @dvenprasad goes:
Also, I poked around a few metadata.json files, it looks like contact info repeats for every sample, would it be possible to "remove" it for display purposes? They can still be part of the download file.
I believe this is in reference to the Additional Metadata modal, rather than the sample table. For this release, we might consider hiding the contact fields to be of the "nice to have" rather than "must have" priority. What do you all think?
For this release, we might consider hiding the contact fields to be of the "nice to have" rather than "must have" priority. What do you all think?
👍
I've filed a new ticket for the current milestone (#532)
@jaclyn-taroni: Attached are two tsv files generated based on the following abridged metadata:
{
'experiments': {
"E-GEOD-44719": {
"accession_code": "E-GEOD-44719",
"sample_titles": [ "IFNa DC_LB016_IFNa" ]
}
},
'samples': {
"IFNa DC_LB016_IFNa": { # Sample #1 is an ArrayExpress sample
"accession_code": "E-GEOD-44719-GSM1089311",
"source_database": "ARRAY_EXPRESS",
"annotations": [
# annotation #1
{
"detected_platform": "illuminaHumanv3",
"detection_percentage": 98.44078,
"mapped_percentage": 100.0
},
# annotation #2
{
"assay": { "name": "GSM1089311" },
# Special field that will be taken out as separate columns
"characteristic": [
{ "category": "cell population",
"value": "IFNa DC"
},
{ "category": "donor id",
"value": "LB016"
}
],
# Another special field in Array Express sample
"variable": [
{ "name": "dose",
"value": "1 mL"
},
{ "name": "stimulation",
"value": "IFNa"
}
],
"extract": { "name": "GSM1089311 extract 1" }
}
] # end of annotations
}, # end of sample #1
"Bone.Marrow_OA_No_ST03": { # Sample #2 is a GEO sample
"accession_code": "GSM1361050",
"annotations": [
{
"channel_count": [ "1" ],
# Special field that will be taken out as separate columns
"characteristics_ch1": [
"tissue: Bone Marrow",
"disease: OA",
"serum: Low Serum"
],
"contact_address": [ "Crown Street" ],
"contact_country": [ "United Kingdom" ],
"data_processing": [ "Data was processed and normalized" ],
"geo_accession": [ "GSM1361050" ],
}
], # end of annotations
"organism": "HOMO_SAPIENS",
"source_database": "GEO"
} # end of sample #2
} # end of "samples"
}
metadata.tsv.txt
E-GEOD-44719_metadata.tsv.txt
(Both files were TSV files, but I had to change their suffixes to attach to github. You can remove .txt
suffix after downloading them.)
metadata.tsv.txt
includes both samples, E-GEOD-44719_metadata.tsv.txt
includes only one sample, because it was aggregated by the experiment.
Can you confirm that they are in the format that you expect?
A few points that you may or may not like:
source_database
column to make the implementation easier.characteristic
and variable
inside annotation, the value will be silently overwritten. In other words, I am assuming that their values are always identical. @dongbohu the tsv files are mostly as I would expect. The only things I find kind of odd are the brackets in the assay
and extract
fields (e.g., {'name': 'GSM1089311'}
).
Thanks for the detailed points!
- The columns in both files are sorted in alphabetic order.
- I added
source_database
column to make the implementation easier.
These changes both sound good to me.
- For ArrayExpress samples, if a field is available in both characteristic and variable inside annotation, the value will be silently overwritten. In other words, I am assuming that their values are always identical.
I am less happy about this one 😉 -- I will note it in the docs (https://github.com/AlexsLemonade/refinebio-docs/issues/4#issuecomment-421376590), though, as we already have caveats around the metadata. Is there a way to throw a warning or something? I'm specifically concerned with having a way to track how often this happens (if at all).
@jaclyn-taroni: As you see in the sample metadata, the values of assay
and extract
fields are both object literal:
"assay": { "name": "GSM1089311" },
"extract": { "name": "GSM1089311 extract 1" }
If we are sure that the values of these two fields always include only one key/value pair and the key is always name
, I can convert these two values into GSM1089311
and GSM1089311 extract 1
.
About the value conflicts between characteristic
and variable
in ArrayExpress samples, I can print out some warning message when a conflict is detected.
I am afraid we can not avoid { ...}
pair in the tsv file. For example, in the following ArrayExpress annotation fields:
'labeled-extract': {
'name': 'GSM1288968 LE 1',
'label': 'biotin'
},
'source': {
'name': 'GSM1288968 1',
'comment': [
{
'name': 'Sample_source_name',
'value': 'pineal glands at CT18, after light exposure'
},
{
'name': 'Sample_title',
'value': 'Pineal_Light_CT18'
}
]
},
...
How can we "flatten" them completely?
Another problem that I just found out: Given the following ArrayExpress sample's annotation field:
'characteristic': [
{ 'category': 'age',
'value': 'adult (0.5-1.5 years old)'
},
{ 'category': 'organism',
'value': 'Danio rerio'
},
...
{ 'category': 'sex',
'value': 'males and females'
},
...
],
age
, organism
and sex
will be taken out as separate columns in the tsv file. But the sample may already has its own harmonized values of age
, organism
, sex
etc. So these annotation values probably should never overwrite the harmonized values.
In other words, if an annotation field name is identical to a sample's own attribute, that annotation field should be ignored. Does this make sense?
Another idea to avoid the field name conflicts is to modify the field names in annotations, for example, change age
in 'characteristic' to something like annotation_characteristic_age
, etc.
Chatted with @cgreene - let's prefix the harmonized values with refinebio_
, so the age in characteristic would be age
and our harmonized values would be in refinebio_age
The harmonized values are all the fields on lines 129-145: https://github.com/AlexsLemonade/refinebio/blob/dev/common/data_refinery_common/models/models.py#L127 right?
These are the ones I'm thinking about specifically
metadata['sex'] = self.sex
metadata['age'] = self.age or ''
metadata['specimen_part'] = self.specimen_part
metadata['genotype'] = self.genotype
metadata['disease'] = self.disease
metadata['disease_stage'] = self.disease_stage
metadata['cell_line'] = self.cell_line
metadata['treatment'] = self.treatment
metadata['race'] = self.race
metadata['subject'] = self.subject
metadata['compound'] = self.compound
metadata['time'] = self.time
(Correct @Miserlou ?)
I'll also note that genotype
should be genetic information
, see: https://github.com/AlexsLemonade/refinebio/pull/252#discussion_r188424351 and that's what is in our documentation.
Would it be possible to have you correct this while you're working on this issue @dongbohu ?
EDIT: re: the genetic information
label -- we should carefully consider how to proceed given the milestone. Using genotype
is not ideal, but I don't want to hold anything up.
@jaclyn-taroni:
refinebio_
prefix to all the fields listed on lines 129-145? Since we can never predict what fields are available in annotations, adding this prefix will solve the field name conflicts for good. genotype
to genetic_information
? All the other field names are using underscore as a delimiter.@dongbohu Okay both of those like a plan 👍
This is a super long thread to be summoned in, give me a sec..
Having both age
and refinebio_age
as direct fields on the same model seems like a bad idea to me, and leads down a bad path of solving problems by throwing more database fields at them. My gut says the better solution is just to have a more robust serializer/renderer for JSON annotations. (I really think a lot of this boils down to the fact that TSV is a really bad format for dealing with data with varying structures.)
We definitely need to check that genotype
/genetic_information
everywhere, I think that was a later change.
@Miserlou: @jaclyn-taroni and I are not talking about creating new fields such as refinebio_age
in Sample
model. Instead, we were talking about changing line 134 from:
metadata['age'] = self.age or ''
to:
metadata['refinebio_age'] = self.age or ''
so that the TSV file generated by smasher will have one column refinebio_age
and possibly another column age
(if it is found in annotation).
Considering the deadline, I think it makes sense to rename the TSV file column to genetic_information
but keep the model's field as is for now. Of course we should file a new issue to keep track of it.
@jaclyn-taroni: I checked the current implimentation of array express sample harmonize()
. Here is what I found out:
The harmonized metadata fields of ArrayExpress sample are based on sample's SDRF file, here is an example: https://www.ebi.ac.uk/arrayexpress/files/E-GEOD-44719/E-GEOD-44719.sdrf.txt This is a TSV file, its first four columns:
Source Name
Comment [Sample_description]
Comment [Sample_source_name]
Comment [Sample_title]
correspond to the source
object in this sample annotation:
https://www.ebi.ac.uk/arrayexpress/json/v3/experiments/E-GEOD-44719/samples
(each sample's annotation data is an entry nested in experiment --> sample
array)
During harmonization, Comment [Sample_title]
is saved as the sample's title, the other three fields are discarded. The question is, do we want to keep the other information in source
or we can ignore them?
@dongbohu my gut feeling is to keep the information but not worry about including it in the TSV -- would that work?
@jaclyn-taroni: You mean keeping source
in sample annotations (so that it will be available in JSON file) but skip it in TSV file?
@dongbohu yes, in that particular example the sample description field is important if going back to the processed data in the source repository (e.g., ArrayExpress), but it seems like a thing that only a select, more computational group of users would want to know (= JSON is okay)
@jaclyn-taroni: Sounds good to me!
@jaclyn-taroni: I realized that sample_id
column in the TSV file is redundant, because its value is always identical to refinebio_title
. Can we remove it?
Sure. We’ll find out if our users find that terminology to be confusing.
Closed by PR #629.
Context
Jackiecrunch (#466)
Problem or idea
We are performing some smoothing of the sample metadata and these harmonized values are the ones being displayed in the samples table view. However, we smooth a limited number of fields, so information that may be important to a user, such as the presence or absence of a particular disease manifestation, may not be available in this table. In addition, when return sample annotations to users in the TSV metadata files, it is essentially JSON in a column. This will be of limited utility to users without considerable programming savvy.
Here's an example of a sample from GEO that has been imported into ArrayExpress:
For this sample, I would like to see the information contained in
variable
andcharacteristic
table form. For example:cell population
would be the header and the value would beIFNa DC
. Note that some of the fields incharacteristic
andvariable
are redundant (pointed here), so retaining only one where there would be duplicates would be quite helpful to users.Here's an example from GEO:
For this example, I am interested in what is contained in the
characteristics
field:Where, ideally,
tissue
,disease
, andserum
would all be column names populated with their respective values. We capture the OA and bone marrow information in our smoothing, but the serum treatment (a key part of the experimental design) is not included.I know there has been some hesitance to venture into this because the presence of nested values has been mentioned. At the moment, I am not sure why are we able to smooth some of the fields but not extract the others such that they are easily displayed or delivered in a tabular format, particularly if we are able to identify the specific fields (e.g.,
characteristic
) that would need to be extracted.I believe this effort would reduce friction for a large portion of our user base.
Solution or next step
~Can we run
SRP056295
, an experiment from SRA, andE-MEXP-31
, ArrayExpress data that has not been imported from GEO, through our system? I do not currently have SRA or ArrayExpress submission metadata on hand, so I can not make a recommendation about what fields to extract yet. Some more examples from ArrayExpress and GEO would probably be helpful, too. These are the ones I happen to have on hand at the moment.~See comment below about zebrafish experiments.
New Issue Checklist