Pitfalls: Describe the situations in which others misinterpret or mis-parse your data

jmcmurry commented 6 years ago

[ ] TopMed
[ ] MODs
[ ] GTex

Please write up draft descriptions of ways that others misinterpret or mis-parse your data; please do so in a way that empowers the commons team to avoid such issues, or at least to know when to come ask for help.

jmcherry-zz commented 6 years ago

Misparse files the don’t read docs. Misunderstand data , multiple definitions of objects like “gene”.

Are there particular data types you are interested in? We have lots of data types we provide from the MODs just a small number so far from Alliance.

cmungall commented 6 years ago

GO

using stale versions of the GO (see Wadi et al)
failure to use relationships in the ontology
using relationships inappropriately
misunderstanding evidence types
using inappropriate statistics for enrichment analysis (will post refs later)

jnedzel commented 6 years ago

The GTEx team will respond to this query shortly.

jmcmurry commented 6 years ago

All, just a reminder that the goal here is to not just describe but to empower people to not make the same mistakes that others have made. Therefore, if you could provide links to the reference specific docs/sites that are "not read" / "not understood", that would be very helpful. Feel free to anonymize specific examples of misuses you've encountered.

@jmcherry I'm not sure if this answers your question above but the main kinds of 'entities' we are interested in are phenotypes, genes and proteins, anatomy, sex, species; and orthogonal entities such as publications. I suspect that ultimately, PPI and pathways will be important too, not to mention, specifics re: variants, alleles, genotypes; however, for 180 days that seems awfully ambitious to me. I could be pleasantly surprised.

bheavner commented 6 years ago

Can you give some more specific guidance about what you’re looking for? And how it differs from what we’ve provided via the TOPMed white paper (linked via https://github.com/orgs/dcppc/teams/everyone/discussions/2)?

For example, some sections of the TOPMed white paper that already describe in detail some of the common misconceptions about TOPMed data include: Section VII on dbGaP data and metadata describes that the TOPMed data is available via dbGaP, not special TOPMed data shares of any kind; Section II on TOPMed projects vs. parents describes the important concept that the studies involved in TOPMed have existed for a long time before TOPMed itself; and Section VIII provides examples of how unharmonized the TOPMed phenotype data is.

jmcmurry commented 6 years ago

Apologies it took some time to get through the 56 pages; this is helpful comprehensive documentation. It fits the bill for helping people understand the data, but doesn't directly answer the question about whether and how exactly other people may be consistently mis-using your data (with its current formats and documentation).

It is also possible to respond that a) you are not sure, or that b) their parsing errors or false assumptions are lots of different things without any common theme or that c) people are using it correctly but not to its full potential. Concrete examples are especially welcome.

Documenting these pain points before we do the dcppc work can help focus our efforts and demonstrate impact.

bheavner commented 6 years ago

Thanks very much for reading the whitepaper!

As to “how exactly other people may be consistently mis-using your data” - because of the access-controlled nature of the data, TOPMed is structured so that users have to make specific proposals and describe their intended data usage prior to gaining access to the data via dbGaP, so there is lots of opportunity for the community to evaluate the intended data usage and any potential for misuse. Our organizational structure is intended to prevent widespread significant data misuse, and seems to have succeeded at that goal so far.

However, we have seen widespread misunderstanding of what the data is, what the data cleaning process is, and what aspects of the data can be meaningfully analyzed - issues such as people wanting to do statistical analysis on variables that are not comparable (thus “phenotype harmonization”). We have also encountered scaling challenges as the genotype data volume grows with each new data freeze/release. This data set is very rich, but meaningful analysis is very difficult.

A few concrete reasons that meaningful analysis is difficult include:

It requires harmonizing lots of heterogeneous, legacy phenotype data;
- This includes 1) determining which phenotype variables are measuring the same thing across studies as well as 2) processing the data values such that they can be analyzed across studies.
- These data were collected individually by each study and posted to dbGaP, oftentimes long before the beginning of the TOPMed project.
The genotype data size makes computation challenging on local resources (very large matrices mean very large memory requirements and processing times);
The consent value for a given subject can change over time, so the current consent values for all subjects must be tracked.

Here are some potential pitfalls regarding phenotypic data (there are many others that are not listed):

Files provided by dbGaP are a mostly consistent format, with some quirks:
- Occasionally additional tab delimiters are present in the file.
- Occasionally rows have fewer tab delimiters than expected because the remaining values in the row are all blank.
Sometimes a subject has enrolled in two different studies, and thus has two different phenotype records. Sometimes this information is provided by dbGaP, but often it is not known.
Failure to properly account for missing codes
- Some study variables use integers as both the data values (e.g, 1-5) and as a other or missing code (e.g., 9). Failure to notice missing codes and exclude values appropriately. Examples of variables that use integers as both the data value and the missing code:
- NPBRAAK, phv00162070.v2.p10
- BP37, phv00104051.v1.p1
SUBJECT_IDs are not uniquely assigned across studies, so one subject in one study could have the same SUBJECT_ID as a different subject in another study. Solution: Include study in the subject identifiers.
Failure to perform appropriate additional data QC
- Phenotype data on dbGaP may need more QC applied. This is very specific to the phenotype concept.
- Internal inconsistencies in study data:
- Example: Recorded white blood cell subtype counts do not add up to recorded total wbc count for some Framingham subjects. Which measurement is correct (if any)?
- Example: Diastolic blood pressure higher than systolic blood pressure for some subjects.
- Possible biologically invalid values
- Difficult to determine whether an outlier is a recording error vs. a true value (ie due to a loss-of-function variant)
Performing cross-study analyses for binary traits with different case definitions.
- Example: One study used a different cutoff for determining diabetes status than another study. Analyzing these variables together is not appropriate. Solution: need to redefine diabetes status the same way for both studies using blood sugar, other measurements, etc.

Is this kind of information what you're looking for, @jmcmurry ?

jmcmurry commented 6 years ago

Yessssss! This is pitch perfect. Thanks so much @bheavner; you've provided a great template for the other stewards too. As soon as you transfer this text to a markdown file in this repo, we can close the topmed portion of this ticket.

jmcmurry commented 6 years ago

Thanks Ben :) I'm going to reopen this just because I'm waiting on it from GTex and Alliance. I've checked the Gtex box at the top of the issue to show it is 1/3 complete.

jnedzel commented 6 years ago

Here are pitfalls concerning the GTEx data:

Analysis is difficult because:

raw files are large and require a lot of compute and processing, but raw RNA seq files must be processed and normalized the same way for any comparisons to be meaningful in terms of quantifying expression differences.
- a key solution for this is to be able to access and run the same analytical pipelines readily. Pipeline and tool access are as important or more so for our data users.

Data access issues:

users complain all the time that file and data IDs/names in dbGaP are not the same as what we named them, and refer to them in the portal, or publications, or the read me files. We get a huge number of complaints and queries about this.
searching and subdividing the data is difficult - users want to do it by gene, or by tissue, or by other variables.

Data interpretation issues/errors:

comparison of expression levels between "different" tissues - users don't realize that this can't be meaningfully done due to differences in cell type and counts. This is also problematic for comparisons with cancers, even if from the same "tissue".
assumption that a significant eQTL (at p-value 'x') in a specific tissue is biologically meaningful. Issues include LD, cell and tissue sharing.

dcppc / data-stewards

Pitfalls: Describe the situations in which others misinterpret or mis-parse your data #6

GO