Open krshedd opened 1 year ago
I like the idea of removing individuals based on their conScore
. However, I think it would be better to have the conScore
imported into Loki so people don't have to search for sample summary text files in order to remove contaminated individuals. If the contamination scores can be imported from Loki into R along with the genotypes, we could modify Loki2R
to import the scores and produce conScore
density distribution plots when type = "GTSNP"
. Then we could then use the remove_ind_con
function to remove contaminated individuals. Of course, we'd have to check with Eric to see if this is possible. What do you think?
@awbarclay , I'd considered having conScore
imported into LOKI as well, however, it gets a bit tricky since it would be tied to both the fish and lab project. A fish could have multiple conScore
if it was genotyped on more than one GT-seq project (i.e., re-runs, different locus panels, etc.). This would not be a problem if we were only pulling genotypes by lab project, but breaks down if you wanted to pull genotypes by a vector of locusnames
if they span multiple GT-seq projects. I'm open to suggestions here, but it gets a bit complicated.
After talking this over with @krshedd, we think it would be great if contaminated fish could be given "0" genotypes before they are imported. That way, the fish will be removed using GCLr::remove_ind_miss_loci()
. The lab staff would have to "no call" the fish before importing the geotypes, which will require functions similar to the ones that @krshedd suggested above to determine a threshold and give contaminated fish "0" scores for all loci to make their lives easier. Lab staff are already "no calling" fish for chip projects, so it wouldn't be much different. @csjalbert is this something that could be implemented in the future?
I like the idea of contaminated fish being no called prior to entering LOKI.
On Thu, Sep 7, 2023 at 11:25 AM Andy Barclay @.***> wrote:
After talking this over with @krshedd https://github.com/krshedd, we think it would be great if contaminated fish could be given "0" genotypes before they are imported. That way, the fish will be removed using GCLr::remove_ind_miss_loci(). The lab staff would have to "no call" the fish before importing the geotypes, which will require functions similar to the ones that @krshedd https://github.com/krshedd suggested above to determine a threshold and give contaminated fish "0" scores for all loci to make their lives easier. Lab staff are already "no calling" fish for chip projects, so it wouldn't be much different. @csjalbert https://github.com/csjalbert is this something that could be implemented in the future?
— Reply to this email directly, view it on GitHub https://github.com/commfish/GCLr/issues/30#issuecomment-1710662043, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3JXOBJH7UMCCWR6X3GNQ3XZINRZANCNFSM6AAAAAA4ODHLZU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
My 2 cents, if a contaminated fish is very likely to already be < 80% successful and will be removed due to that rule, what is the benefit of loading it all up in Loki with 0/0 calls? Additionally, if it is all 0/0 in Loki, the assumption is going to be that it failed and that maybe it just needs to be rerun vs. crappy success rate fish that therefore know is crappy.
@hahoyt there are instances where fish can have a high conScore
, but still have a high genotype rate. This plot is an example from K205 - Unuk Chinook 2021 tGMR. The fish in the upper right hand corner have high contamination and high genotyping success, thus the only way to remove them would be from the GTscore conScore
. Think of these as fish that lab staff would have no-called in the past because the VIC/FAM plots were too "fluffy". So far the consensus idea that Andy and I have been mulling would be to add some code to the pipeline post-GTscore, but pre-LOKI import that would remove these. Definitely want to get @csjalbert thoughts on this since he knows the pipeline better than all of us.
Oh, I see. So we can't assume a contaminated fish (based on conScore) will have a < 80% success rate. Also, I think that uSATs are scored and there is a clearly contaminated fish, the team no calls it for all markers. So this wouldn't be any different than that. Cool.
Exactly, same as how we do SNPs on chips and uSATs, get rid of fish with junk/contaminated genotypes before they go into LOKI.
All SNPs on chips are not 0/0'd out for a fish with contamination. The will have more 0/0's because of the fluffiness but the genotypers don't select the fish for all markers and 0/0 it out. Or at least, we never have.
Right, sorry for adding confusion. The point is those 0/0s for SNPs on chips (sounds like a tasty snack?) likely push the fish <80% genotyping success, so they drop out in downstream analyses. That is not the case with GTscore, hence my desire to do something with conScore
.
Roger that. :-)
I agree that it makes sense to deal with these contaminated samples. This seems like something that could be implemented in the GTscore pipeline. It could be as simple as a script that runs post-GTscore --that's how the genotype rate plots work. That said, a few questions to make sure I'm understanding correctly:
I'm unclear if the lab (or P/Ls) would select their cut offs and 0 out fish for each project or are you thinking some sort of standardized values that automatically apply (e.g., 0.8 genotype rate and 0.3 contamination) for all projects?
Tying into 2 above, if it involves human review, then it probably wouldn't need to be part of the pipeline, but a few extra GCLr
functions that someone runs after the data is transferred from the server..?
Thanks @csjalbert for the clarifying questions and forgive my lack of a detailed understanding of the order of operations for different pipeline steps.
_singleSNP_sampleSummary.txt
files to plot a density distribution of conScore
(or a plotly version of heterozygosity vs. conScore like what is already included in _SampleSummaryPlots.pdf
) so lab staff could identify a threshold. Then run another function that uses that threshold to remove contaminated fish from the final LOKI input file.Does that make sense? Anything major I'm missing?
@krshedd this makes sense to me. I don't see a way around human review on a project-by-project basis, so it makes sense to set this up on V:, where lab staff have easy access. The only additional comment, is that I will not split LOKI files on the server. We can take care of the split on V: after the contaminated fish have been removed. I suppose this would be a 3rd function, that we may not even need, once we test the new importer.
[x] loki_splits.r
- split the filtered LOKI file into 60Mb chunks. Make sure to name it "LOKI_split_x_filtered" or something like that so it's clear this is not the raw LOKI file.
Not sure how to write out certain size CSVs in R - any ideas?
Perhaps quick fix is to split by 500k lines or some random amount that is under our file limit.
I'll work on these functions soon and let you know what I can come up with.
Just a note that apparently 60mb files no longer work with our importer. split_gtscore_loki_import.R
is a new function that splits LOKI files into usable chunks and it should be used with any contamination score function(s).
We currently do not have any functions to remove contaminated individual based on GTscore
conScore
. I propose that we create 2 functions:find_ind_con
- reads inconScore
from GTscore singleSNP_sampleSummary.txt file(s), plots density distribution ofconScore
or heterozygosity vs.conScore
similar to GTscore SampleSummaryPlots.pdf output, and outputs modified version of singleSNP_sampleSummary.txt; user inspects plot(s) to determineconScore
cutoff.remove_ind_con
- takes the output fromfind_ind_con
in concert with aconScore
cutoff to remove individuals above a certain threshold, this threshold may be specific to a given GT-seq panel.The idea would be for these 2 new functions to become part of our standard QA process along with
remove_ind_miss_loci
,dupcheck_within_silly
,remove_dups
,find_alt_species
, andremove_alt_species
. Previously, using TaqMan, contaminated individuals would likely be no-called for enough SNPs that they'd drop out withremove_ind_miss_loci
, but that is not necessarily the case with GT-seq.Open to other ideas, but @csjalbert and I can work on these when we analyze C015 SEAK coho baseline.