Open jmcmurry opened 6 years ago
Misparse files the don’t read docs. Misunderstand data , multiple definitions of objects like “gene”.
Are there particular data types you are interested in? We have lots of data types we provide from the MODs just a small number so far from Alliance.
The GTEx team will respond to this query shortly.
All, just a reminder that the goal here is to not just describe but to empower people to not make the same mistakes that others have made. Therefore, if you could provide links to the reference specific docs/sites that are "not read" / "not understood", that would be very helpful. Feel free to anonymize specific examples of misuses you've encountered.
@jmcherry I'm not sure if this answers your question above but the main kinds of 'entities' we are interested in are phenotypes, genes and proteins, anatomy, sex, species; and orthogonal entities such as publications. I suspect that ultimately, PPI and pathways will be important too, not to mention, specifics re: variants, alleles, genotypes; however, for 180 days that seems awfully ambitious to me. I could be pleasantly surprised.
Can you give some more specific guidance about what you’re looking for? And how it differs from what we’ve provided via the TOPMed white paper (linked via https://github.com/orgs/dcppc/teams/everyone/discussions/2)?
For example, some sections of the TOPMed white paper that already describe in detail some of the common misconceptions about TOPMed data include: Section VII on dbGaP data and metadata describes that the TOPMed data is available via dbGaP, not special TOPMed data shares of any kind; Section II on TOPMed projects vs. parents describes the important concept that the studies involved in TOPMed have existed for a long time before TOPMed itself; and Section VIII provides examples of how unharmonized the TOPMed phenotype data is.
Apologies it took some time to get through the 56 pages; this is helpful comprehensive documentation. It fits the bill for helping people understand the data, but doesn't directly answer the question about whether and how exactly other people may be consistently mis-using your data (with its current formats and documentation).
It is also possible to respond that a) you are not sure, or that b) their parsing errors or false assumptions are lots of different things without any common theme or that c) people are using it correctly but not to its full potential. Concrete examples are especially welcome.
Documenting these pain points before we do the dcppc work can help focus our efforts and demonstrate impact.
Thanks very much for reading the whitepaper!
As to “how exactly other people may be consistently mis-using your data” - because of the access-controlled nature of the data, TOPMed is structured so that users have to make specific proposals and describe their intended data usage prior to gaining access to the data via dbGaP, so there is lots of opportunity for the community to evaluate the intended data usage and any potential for misuse. Our organizational structure is intended to prevent widespread significant data misuse, and seems to have succeeded at that goal so far.
However, we have seen widespread misunderstanding of what the data is, what the data cleaning process is, and what aspects of the data can be meaningfully analyzed - issues such as people wanting to do statistical analysis on variables that are not comparable (thus “phenotype harmonization”). We have also encountered scaling challenges as the genotype data volume grows with each new data freeze/release. This data set is very rich, but meaningful analysis is very difficult.
A few concrete reasons that meaningful analysis is difficult include:
It requires harmonizing lots of heterogeneous, legacy phenotype data;
The genotype data size makes computation challenging on local resources (very large matrices mean very large memory requirements and processing times);
The consent value for a given subject can change over time, so the current consent values for all subjects must be tracked.
Here are some potential pitfalls regarding phenotypic data (there are many others that are not listed):
Files provided by dbGaP are a mostly consistent format, with some quirks:
Sometimes a subject has enrolled in two different studies, and thus has two different phenotype records. Sometimes this information is provided by dbGaP, but often it is not known.
Failure to properly account for missing codes
SUBJECT_IDs are not uniquely assigned across studies, so one subject in one study could have the same SUBJECT_ID as a different subject in another study. Solution: Include study in the subject identifiers.
Failure to perform appropriate additional data QC
Performing cross-study analyses for binary traits with different case definitions.
Is this kind of information what you're looking for, @jmcmurry ?
Yessssss! This is pitch perfect. Thanks so much @bheavner; you've provided a great template for the other stewards too. As soon as you transfer this text to a markdown file in this repo, we can close the topmed portion of this ticket.
Thanks Ben :) I'm going to reopen this just because I'm waiting on it from GTex and Alliance. I've checked the Gtex box at the top of the issue to show it is 1/3 complete.
Here are pitfalls concerning the GTEx data:
Analysis is difficult because:
raw files are large and require a lot of compute and processing, but raw RNA seq files must be processed and normalized the same way for any comparisons to be meaningful in terms of quantifying expression differences.
Data access issues:
users complain all the time that file and data IDs/names in dbGaP are not the same as what we named them, and refer to them in the portal, or publications, or the read me files. We get a huge number of complaints and queries about this.
searching and subdividing the data is difficult - users want to do it by gene, or by tissue, or by other variables.
Data interpretation issues/errors:
comparison of expression levels between "different" tissues - users don't realize that this can't be meaningfully done due to differences in cell type and counts. This is also problematic for comparisons with cancers, even if from the same "tissue".
assumption that a significant eQTL (at p-value 'x') in a specific tissue is biologically meaningful. Issues include LD, cell and tissue sharing.
Please write up draft descriptions of ways that others misinterpret or mis-parse your data; please do so in a way that empowers the commons team to avoid such issues, or at least to know when to come ask for help.