Open thackl opened 6 years ago
Hi Thomas,
Good question! Do you have an idea what parts of your genome are missing? How are you getting the % completeness? (e.g. do you know the genome size, do you have some indicator regions that are missing, do you have several sequencing runs from the same genome?)
I've thought about having a "missing" category for the gene presence absence data. (But this has not been implemented yet). But from what I understand this would not really help you because you have no way of knowing whether data is missing or absent(?)
If there is no way of differentiating missing from absent, I don't think this is something Scoary can fix unfortunately. However, if you have some idea (let's say that it's always the "same" regions that are missing for example) then perhaps a strategy similar to missing data imputation from statistics would help. That is, have the more complete isolates inform the less complete. This would need to take the population structure of your genomes into account as well. Doesn't really sound like a straightforward approach I'm afraid.
If the data is reasonably complete, you could still try running Scoary on your data directly, but your association scores will be a little messed up. In general, your positive associations will be biased downward (making them seem less significant) while your negative associations will be biased upward (making them seem more significant). This is assuming no systematic bias in the missingness. And if you have a small data set I would be very wary about random error.
Cheers, Ola
Hi Ola, thanks a lot for getting back to me on that. My primary data set is a collection of both, some complete isolates but mostly incomplete single cell genomes, from a single marine cyanobacterium (Prochlorococcus). The samples have been collected from open oceans all around the world. So it is field data, and we don't have multiple sequencing runs for the same genome. Also, my idea was to not look directly at traits, but rather find associations between gene presence/absences and adaptations to environmental conditions or proxies thereof. Something as simple as genes associated with Atlantic vs. Pacific, for example - I'm not even sure if that is something I should should use scoary for in the first place...
What parts of the genome are missing is for the most part random. It comes down to DNA strand breaks during extraction and how evenly DNA gets amplified during the initial amplification rounds.
For completeness estimations I use CheckM, or similar approaches. I.e., I take a set of ~700 core genes which I expect to always be present, and then simply extrapolate the recovery rate of that core set to an overall completeness of the assembly. I also do have a fairly good idea about expected genome sizes (2-2.5kbp), but that gets more iffy especially when looking at flexible genes, and how they contribute to genome expansion.
I've been thinking about imputation of missing data as well. That would also help with a lot of other down-stream analyses for these kind of data, which usually also don't work well or at all for incomplete data. And yes, I agree, it would need to be based on phylogenetic information, too. Unfortunately I haven't found a good way to do that yet.
I might start playing around with some of my more complete genomes, and see what I get. Will let you know if I can get some interesting findings.
Thanks again for taking the time & Happy New Year!
The association of different genes to environmental niches is indeed something Scoary can be used for. If you're looking for just correlations (genes enriched in Atlantic vs Pacific environments for example) then I would drop the pairwise comparisons measures, which have more to do with causality (Like genes involved in adaptation to a particular environment, however I doubt there are enough separate Pacific -> Atlantic adaptation events to gain much insight here).
If the parts that are missing are truly random then it might be worth giving a shot!
Would love to hear if you figure out something clever about the data imputation problem! Thanks for giving me a really interesting problem to think about. I'll leave this issue open for a while in case some clever heads happen to pop by.
Happy New Year to you too!
Hi Ola,
I'm working with a large number single-cell amplified genomes, i.e. the individual assemblies are incomplete, ranging from ~30%-95% estimated completeness. This means that I do get reliable gene "presences", but "absences" can mean either true absence or just missed in the assembly.
I was wondering, what your thoughts on these kind of data would be with respect to association testing. And do you think, Scoary could be used / customized to analyze those data?
Cheers, Thomas