AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
128 stars 19 forks source link

Add and Harmonize SDRF Data #165

Closed Miserlou closed 6 years ago

Miserlou commented 6 years ago

Samples on ArrayExpress have meta-information in sdrf files. We may want to turn some of this information into structured/harmonized fields on our sample table, and to define some harmony with samples from other data sources.

I'm currently crunching the statistics about what these values are. So far, it looks like quite sparse/noisy data.

Miserlou commented 6 years ago

The stats are finally done. Here are the values:

https://gist.githubusercontent.com/Miserlou/479348184735a084107288116d2da8b3/raw/ce9fb186c9f84ba3c270f47b9c1fd82e4274b946/array_express_sdrf_values.txt

It is an extremely long tail. It could maybe be cleaned up further, (ex FactorValue[ vs Factor Value[ vs FactorValue [ vs Factor Value [, etc), but I still think that it'd be a very long tail.

Not really sure what to do about this.

I think we could at least give it a shot for all of the various conceptions of "Organism Part".

jaclyn-taroni commented 6 years ago

I will go through the very top of this list and identify fields I think are equivalent & my rationale for inclusion

jaclyn-taroni commented 6 years ago

@Miserlou @dvenprasad -- is the thought that harmonizing these fields somewhat would be most useful for filtering samples (i.e., we're most concerned with the presence or absence of this info for a particular sample/experiment)? That will inform some of my choices.

Miserlou commented 6 years ago

I was imagining both for filtering/searching - in which case perfect harmonization is less important - but also for choosing which information we want to prepare for display.

jaclyn-taroni commented 6 years ago

As far as display is concerned, resolving some of this stuff

ex FactorValue[ vs Factor Value[ vs FactorValue [ vs Factor Value [, etc

Is probably a worthwhile starting point. For displaying a given sample (think table), I think we'll probably want to include all of the information in the sdrf -- perhaps prioritizing "harmonized" categories, i.e., putting them as the first columns

jaclyn-taroni commented 6 years ago

Is there any documentation about the difference between a characteristic and a factor value? In my experience, they can often be redundant.

jaclyn-taroni commented 6 years ago

From this documentation:

These Factor Value[] columns reference the experimental factor names defined in the IDF, and should be placed at the end of the SDRF. The contents of these columns will usually duplicate those in a material Characteristics column or a protocol Parameter Value column.

For filtering purposes, we can likely treat them as the same.

jaclyn-taroni commented 6 years ago

These are best guesses I can make in the absence of a small random sample of how folks use these for their experiments. (That is an approach we can take in the future.) I've generally only gotten to the terms that are included in tens of experiments. We should also revisit information about library strategies, selection, strand, etc. when we take a look at the sequencing repositories.

Note I'm not concerned with Unit or Parameter here and we should consider Factor Value and Characteristic to contain the same information for reasons stated above. And we should normalize this kind of thing as well:

(ex FactorValue[ vs Factor Value[ vs FactorValue [ vs Factor Value [, etc)

I also include terms here that should be used ignoring case, spacing, and often punctuation (specifically, I think we should be ignoring -, _, and .). I've included some rationale and some notes where I have them for a particular category below :) and this is only with filtering in mind.

Genetic background

For filtering only (i.e., we're mostly interested in the presence or absence of metadata attached to a sample), I think it's appropriate to lump together the following categories: genotype, genetic background, variation, strain or line, background strain, cultivar and ecotype. This basically would indicate that there is some information about the genetic background (in a broad sense) that is associated with an individual sample. These are not the exact same information (multiple cultivars could potentially have the same genetic background and I bet people use "genotype" and "genetic background" a wide range of different ways depending on their individual experiments) and are good candidates for further refinement at a later date.

First pass would include the following terms (regardless of case, spacing, factor value or characteristic, presence or absence of underscores, etc.): strain/background, strain, strain or line, background strain, genotype , genetic background, genotype/variation, ecotype, cultivar, strain/genotype

Sex or gender

Terms: sex, gender, subject gender

Batch

I think anything that contains the word batch is appropriate... we're probably looking to capture any technical variation that could be considered a batch effect of some kind. Here are some examples: batch, scan date, batch no, hybridization batch, extraction batch, experimental batch number, slide batch, batch id, array batch, batch number, array scan date

Organism part, tissue, or cell type

The term organism part looks like it often encodes tissue or cell type information. So I think first pass at filtering, it's okay to lump them together. In the future, we might try and resolve these in some way that will allow us to potentially filter samples/experiments based on individual tissues or cell types.

Terms: organism part, cell type, tissue, tissue type, tissue source, tissue origin, source tissue, tissue subtype, tissue/cell type, tissue region, tissue compartment, tissues, tissue of origin, tissue-type, tissue harvested, cell/tissue type, tissue subregion, organ

Organism

Term: organism

I believe this should be redundant with other information in the database already, though.

Age

Some age information associated with sample

Term: age, patient age, age of patient, age (years), age at diagnosis, age at diagnosis years

Disease / disease state / diagnosis

Some information about the presence or absence of a particular disease or diagnosis

Terms: disease, disease state, disease status, diagnosis

Disease stage or grade

Some information about stage or severity of disease

Terms: disease staging, disease stage, grade, tumor grade, who grade, histological grade, tumor grading

Survival

As in "some survival associated with sample", probably take anything that contains survival -- there are going to be more subtle things like os sometimes means overall survival that we should consider doing on "compendium-specific basis" if that's a direction we head in. For instance, if we have a curated set of publicly neuroblastoma data that contains tens of experiments, we might consider working with domain experts to harmonize this information (very) manually.

Cell line

Term: cell line

Treatment

Some information about how the sample or patient was treated, without any promise of temporal information (i.e., treatment duration, time course) or concentration/dosage

Terms: treatment, treatment group, treatment protocol, drug treatment, clinical treatment

Race or ethnicity

Terms: race , ethnicity, race/ethnicity

Individual, patient, or subject

If multiple samples come from a single individual, there will likely be information indicating that the sample came from the same individual.

Terms: subject, subject id, subject/sample source id, subject identifier, human subject anonymized id, individual, individual identifier, individual id, patient, patient id, patient identifier, patient number, patient no, donor id, donor

Developmental stage

Terms: developmental stage, development stage, development stages

Candidate for further follow-up: developmental landmark -- if only used in human experiments, probably doesn't mean the same thing

Compound

Terms: compound, compound1, compound2, compound name

Time or time point

Any term that alludes to the fact that this sample/experiment could contain time course information of some kind (but excluding things that are specifically about treatment because it's not entirely obvious that that would be appropriate to harmonize)

Term: time, initial time point, start time, stop time, time point, sampling time point, sampling time


Would be excellent to do some more perturbations like RNAi, genetic modification, etc. and information like media, grown conditions, temperature, etc. in the future/with more help + time.

jaclyn-taroni commented 6 years ago

Tagging @cgreene to weigh in on these initial choices

cgreene commented 6 years ago

These choices generally look good. drug treatment looks like it might be more appropriate in the Compound bit, but it's hard to know without seeing those values.

jaclyn-taroni commented 6 years ago

How easy/difficult is it to snag a handful of experiments that use the values that we have questions about (e.g., drug treatment, compound, developmental landmark) @Miserlou ?

This would be a good capability to have in the future as we further refine these categories or try other methods of harmonization.

cgreene commented 6 years ago

potentially useful package as we're thinking about this: https://github.com/seatgeek/fuzzywuzzy

Miserlou commented 6 years ago

pretty.json.zip

Okay, that's the whole AE-wide key/value set with counts. Warning: 50 megabytes of JSON in that ZIP.

jaclyn-taroni commented 6 years ago

The idea of fuzzy string matching and other simple rules as the first pass for harmonizing the sample metadata is one case (and the only one I can think of currently) where we might make use of curated GEO DataSets rather than Series (see GEO overview and related issue #161). Specifically, if the metadata in a DataSet has been curated in order to facilitate the comparison between experimental groups, we could use these as labels to evaluate the performance of our rules. I view this as secondary to and somewhat loosely related to the main GEO surveyor/downloader from a scientific question perspective, which is why I've chosen to include this comment here rather than #161.

Miserlou commented 6 years ago

I have a first pass of these working for all our supported data sources. I encountered the cell part / organ part overlap that you described above, particularly in GEO data, ex:

# from GSE32628:`characteristics_ch1`
['patient: P-39',
'gender: female',
'age: 65',
'location: lower leg',
'transplanted organ: kidney',
'immunosuppressive drugs: azathioprine + prednison',
'sample type: squamous cell carcinoma',
'cell type: keratinocyte']

What is the most important thing to emphasize here (leg, kidney, carcinoma, kertinocyte)?

I'd really like to get cell type and organ split into different fields if possible.

(Also, this sample is confusing - does this women have her kidney in her leg?)

jaclyn-taroni commented 6 years ago

The most important thing in that example is keratinocyte -- this looks like keratinocyte (epidermis) samples from the lower leg of a patient that received a kidney transplant. But I had to read the experiment to get that. I would think, in general, we want the finest level of detail, which would be the cell type. In the case of keratinocyte, I know that it is from skin because that's where that cell type occurs. However, imagine I have cell type: macrophage and in some samples or experiments they are isolated from peripheral blood and in others they are alveolar macrophages (this information would be the tissue in this case) -- I may want to know both tissue and cell type.

I'd really like to get cell type and organ split into different fields if possible.

As stated above, these were lumped together for filtering purposes. If I were to filter by "has organism part information", lumping them makes sense to me.

jaclyn-taroni commented 6 years ago

Closed by #252