dcppc / data-stewards

Questions and answers about TOPmed, GTEx, and AGR resources.
8 stars 0 forks source link

Pitfalls: Describe the situations in which others misinterpret or mis-parse your data #6

Open jmcmurry opened 6 years ago

jmcmurry commented 6 years ago

Please write up draft descriptions of ways that others misinterpret or mis-parse your data; please do so in a way that empowers the commons team to avoid such issues, or at least to know when to come ask for help.

jmcherry-zz commented 6 years ago

Misparse files the don’t read docs. Misunderstand data , multiple definitions of objects like “gene”.

Are there particular data types you are interested in? We have lots of data types we provide from the MODs just a small number so far from Alliance.

cmungall commented 6 years ago

GO

jnedzel commented 6 years ago

The GTEx team will respond to this query shortly.

jmcmurry commented 6 years ago

All, just a reminder that the goal here is to not just describe but to empower people to not make the same mistakes that others have made. Therefore, if you could provide links to the reference specific docs/sites that are "not read" / "not understood", that would be very helpful. Feel free to anonymize specific examples of misuses you've encountered.

@jmcherry I'm not sure if this answers your question above but the main kinds of 'entities' we are interested in are phenotypes, genes and proteins, anatomy, sex, species; and orthogonal entities such as publications. I suspect that ultimately, PPI and pathways will be important too, not to mention, specifics re: variants, alleles, genotypes; however, for 180 days that seems awfully ambitious to me. I could be pleasantly surprised.

bheavner commented 6 years ago

Can you give some more specific guidance about what you’re looking for? And how it differs from what we’ve provided via the TOPMed white paper (linked via https://github.com/orgs/dcppc/teams/everyone/discussions/2)?

For example, some sections of the TOPMed white paper that already describe in detail some of the common misconceptions about TOPMed data include: Section VII on dbGaP data and metadata describes that the TOPMed data is available via dbGaP, not special TOPMed data shares of any kind; Section II on TOPMed projects vs. parents describes the important concept that the studies involved in TOPMed have existed for a long time before TOPMed itself; and Section VIII provides examples of how unharmonized the TOPMed phenotype data is.

jmcmurry commented 6 years ago

Apologies it took some time to get through the 56 pages; this is helpful comprehensive documentation. It fits the bill for helping people understand the data, but doesn't directly answer the question about whether and how exactly other people may be consistently mis-using your data (with its current formats and documentation).

It is also possible to respond that a) you are not sure, or that b) their parsing errors or false assumptions are lots of different things without any common theme or that c) people are using it correctly but not to its full potential. Concrete examples are especially welcome.

Documenting these pain points before we do the dcppc work can help focus our efforts and demonstrate impact.

bheavner commented 6 years ago

Thanks very much for reading the whitepaper!

As to “how exactly other people may be consistently mis-using your data” - because of the access-controlled nature of the data, TOPMed is structured so that users have to make specific proposals and describe their intended data usage prior to gaining access to the data via dbGaP, so there is lots of opportunity for the community to evaluate the intended data usage and any potential for misuse. Our organizational structure is intended to prevent widespread significant data misuse, and seems to have succeeded at that goal so far.

However, we have seen widespread misunderstanding of what the data is, what the data cleaning process is, and what aspects of the data can be meaningfully analyzed - issues such as people wanting to do statistical analysis on variables that are not comparable (thus “phenotype harmonization”). We have also encountered scaling challenges as the genotype data volume grows with each new data freeze/release. This data set is very rich, but meaningful analysis is very difficult.

A few concrete reasons that meaningful analysis is difficult include:

Here are some potential pitfalls regarding phenotypic data (there are many others that are not listed):

Is this kind of information what you're looking for, @jmcmurry ?

jmcmurry commented 6 years ago

Yessssss! This is pitch perfect. Thanks so much @bheavner; you've provided a great template for the other stewards too. As soon as you transfer this text to a markdown file in this repo, we can close the topmed portion of this ticket.

jmcmurry commented 6 years ago

Thanks Ben :) I'm going to reopen this just because I'm waiting on it from GTex and Alliance. I've checked the Gtex box at the top of the issue to show it is 1/3 complete.

jnedzel commented 6 years ago

Here are pitfalls concerning the GTEx data:

Analysis is difficult because:

Data access issues:

Data interpretation issues/errors: