commfish / GCLr

Gene Conservation Lab R package repository
3 stars 0 forks source link

function(s) to identify and remove individuals based on GTscore conScore #30

Open krshedd opened 1 year ago

krshedd commented 1 year ago

We currently do not have any functions to remove contaminated individual based on GTscore conScore. I propose that we create 2 functions:

The idea would be for these 2 new functions to become part of our standard QA process along with remove_ind_miss_loci, dupcheck_within_silly, remove_dups, find_alt_species, and remove_alt_species. Previously, using TaqMan, contaminated individuals would likely be no-called for enough SNPs that they'd drop out with remove_ind_miss_loci, but that is not necessarily the case with GT-seq.

Open to other ideas, but @csjalbert and I can work on these when we analyze C015 SEAK coho baseline.

awbarclay commented 1 year ago

I like the idea of removing individuals based on their conScore. However, I think it would be better to have the conScore imported into Loki so people don't have to search for sample summary text files in order to remove contaminated individuals. If the contamination scores can be imported from Loki into R along with the genotypes, we could modify Loki2R to import the scores and produce conScore density distribution plots when type = "GTSNP". Then we could then use the remove_ind_con function to remove contaminated individuals. Of course, we'd have to check with Eric to see if this is possible. What do you think?

krshedd commented 1 year ago

@awbarclay , I'd considered having conScore imported into LOKI as well, however, it gets a bit tricky since it would be tied to both the fish and lab project. A fish could have multiple conScore if it was genotyped on more than one GT-seq project (i.e., re-runs, different locus panels, etc.). This would not be a problem if we were only pulling genotypes by lab project, but breaks down if you wanted to pull genotypes by a vector of locusnames if they span multiple GT-seq projects. I'm open to suggestions here, but it gets a bit complicated.

awbarclay commented 1 year ago

After talking this over with @krshedd, we think it would be great if contaminated fish could be given "0" genotypes before they are imported. That way, the fish will be removed using GCLr::remove_ind_miss_loci(). The lab staff would have to "no call" the fish before importing the geotypes, which will require functions similar to the ones that @krshedd suggested above to determine a threshold and give contaminated fish "0" scores for all loci to make their lives easier. Lab staff are already "no calling" fish for chip projects, so it wouldn't be much different. @csjalbert is this something that could be implemented in the future?

tylerdann commented 1 year ago

I like the idea of contaminated fish being no called prior to entering LOKI.

On Thu, Sep 7, 2023 at 11:25 AM Andy Barclay @.***> wrote:

After talking this over with @krshedd https://github.com/krshedd, we think it would be great if contaminated fish could be given "0" genotypes before they are imported. That way, the fish will be removed using GCLr::remove_ind_miss_loci(). The lab staff would have to "no call" the fish before importing the geotypes, which will require functions similar to the ones that @krshedd https://github.com/krshedd suggested above to determine a threshold and give contaminated fish "0" scores for all loci to make their lives easier. Lab staff are already "no calling" fish for chip projects, so it wouldn't be much different. @csjalbert https://github.com/csjalbert is this something that could be implemented in the future?

— Reply to this email directly, view it on GitHub https://github.com/commfish/GCLr/issues/30#issuecomment-1710662043, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3JXOBJH7UMCCWR6X3GNQ3XZINRZANCNFSM6AAAAAA4ODHLZU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hahoyt commented 1 year ago

My 2 cents, if a contaminated fish is very likely to already be < 80% successful and will be removed due to that rule, what is the benefit of loading it all up in Loki with 0/0 calls? Additionally, if it is all 0/0 in Loki, the assumption is going to be that it failed and that maybe it just needs to be rerun vs. crappy success rate fish that therefore know is crappy.

krshedd commented 1 year ago

@hahoyt there are instances where fish can have a high conScore, but still have a high genotype rate. This plot is an example from K205 - Unuk Chinook 2021 tGMR. The fish in the upper right hand corner have high contamination and high genotyping success, thus the only way to remove them would be from the GTscore conScore. Think of these as fish that lab staff would have no-called in the past because the VIC/FAM plots were too "fluffy". So far the consensus idea that Andy and I have been mulling would be to add some code to the pipeline post-GTscore, but pre-LOKI import that would remove these. Definitely want to get @csjalbert thoughts on this since he knows the pipeline better than all of us.

image

hahoyt commented 1 year ago

Oh, I see. So we can't assume a contaminated fish (based on conScore) will have a < 80% success rate. Also, I think that uSATs are scored and there is a clearly contaminated fish, the team no calls it for all markers. So this wouldn't be any different than that. Cool.

krshedd commented 1 year ago

Exactly, same as how we do SNPs on chips and uSATs, get rid of fish with junk/contaminated genotypes before they go into LOKI.

hahoyt commented 1 year ago

All SNPs on chips are not 0/0'd out for a fish with contamination. The will have more 0/0's because of the fluffiness but the genotypers don't select the fish for all markers and 0/0 it out. Or at least, we never have.

krshedd commented 1 year ago

Right, sorry for adding confusion. The point is those 0/0s for SNPs on chips (sounds like a tasty snack?) likely push the fish <80% genotyping success, so they drop out in downstream analyses. That is not the case with GTscore, hence my desire to do something with conScore.

hahoyt commented 1 year ago

Roger that. :-)

csjalbert commented 1 year ago

I agree that it makes sense to deal with these contaminated samples. This seems like something that could be implemented in the GTscore pipeline. It could be as simple as a script that runs post-GTscore --that's how the genotype rate plots work. That said, a few questions to make sure I'm understanding correctly:

  1. Would we just change the LOKI file or is the idea to have this act on other outputs?
  1. I'm unclear if the lab (or P/Ls) would select their cut offs and 0 out fish for each project or are you thinking some sort of standardized values that automatically apply (e.g., 0.8 genotype rate and 0.3 contamination) for all projects?

  2. Tying into 2 above, if it involves human review, then it probably wouldn't need to be part of the pipeline, but a few extra GCLr functions that someone runs after the data is transferred from the server..?

    • this seems like what @krshedd described in the first post.
krshedd commented 1 year ago

Thanks @csjalbert for the clarifying questions and forgive my lack of a detailed understanding of the order of operations for different pipeline steps.

  1. Only the final LOKI input file needs to be changed. This could be done once everything is off the server and transferred onto the V: drive. We could rename the LOKI file generated by the pipeline on the server as preliminary or something so it is clear that contaminated fish haven't been removed yet. We could also include a README.txt or something to clarify that all the other GTscore associated files (genepop, rubias, etc.) on the V: drive are raw, and include contaminated fish.
  2. I think it will require some user intervention. My naive though was just have an R function that uses the _singleSNP_sampleSummary.txt files to plot a density distribution of conScore (or a plotly version of heterozygosity vs. conScore like what is already included in _SampleSummaryPlots.pdf) so lab staff could identify a threshold. Then run another function that uses that threshold to remove contaminated fish from the final LOKI input file.
  3. Correct, since this requires human intervention, this would all occur on the V: drive, after files have been transferred from the server.

Does that make sense? Anything major I'm missing?

csjalbert commented 1 year ago

@krshedd this makes sense to me. I don't see a way around human review on a project-by-project basis, so it makes sense to set this up on V:, where lab staff have easy access. The only additional comment, is that I will not split LOKI files on the server. We can take care of the split on V: after the contaminated fish have been removed. I suppose this would be a 3rd function, that we may not even need, once we test the new importer.

I'll work on these functions soon and let you know what I can come up with.

csjalbert commented 11 months ago

Just a note that apparently 60mb files no longer work with our importer. split_gtscore_loki_import.R is a new function that splits LOKI files into usable chunks and it should be used with any contamination score function(s).