brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Filtering on nsites #99

Open marcustutert opened 2 years ago

marcustutert commented 2 years ago

Hi Brent,

I was wondering if you are aware/have an easy way to run SOMALIER but only consider the relatedness when n (the shared number of sites) is higher than a given threshold.

This would be super helpful in my use-case where what appears to be happening is I am looking at the relatedness between samples within one cohort (that is stored as a VCF) and another cohort which is a series of CRAM which are the unsequenced version of the VCF samples. What I am noticing is happening is that the relatedness metrics I am getting are extremely overinflated because the number of sites used for comparison in some of these CRAM files is only 2-10 sites and a result I am getting some spurious results.

It would be helpful if SOMALIER would be adapted to either (or ideally both!) run the relatedness and output the relatedness on a filter for the number of shared sites used in this calculation.

Is that some thing that can be done on my/your end?

Marcus

brentp commented 2 years ago

You should be able to do this with awk (or other script) on the pairs.tsv output. Do you mean to do it in the html output? currently that's not supported.

marcustutert commented 2 years ago

Hi Brent -- not quite exactly what I'm after here

Im more wondering at the stage of calculating the relatedness internally (doing all pairwise comparisons) is there a way to exclude from the calculations those pairs which will results in a nsites < given threshold?

brentp commented 2 years ago

I don't understand how that's different from what I suggest. A pair is a pair of samples. Pairs.tsv has all pairs and includes a column for n which is n-sites.

marcustutert commented 2 years ago

Hi Brent,

You are absolutely correct in the end I achieve the same result as what you suggested above. The only reason why this internal change would be helpful is I'm currently just waiting for SOMALIER to finish running through a very large (100k + number of samples) of which the vast majority will have a very small number of sites and will be thrown out at the end and as a result all my time is now spent just waiting for the results file to be read out to my directory. If there was a way to speed this up by considering only pairs that I know I will only be considering in the end anyways that would prove to save a lot of my time here.

If not, that is OK and I will just have to be patient!

Best

marcustutert commented 2 years ago

FYI, I've come up with a somewhat hacky workaround atm which involves simply deleting the .somalier files which correspond to a low number of extracted sites, which I take the number of lines in the file as a proxy for this count. Looking at my outputted relatedness table this seems to get rid of the most problematic CRAM<->non-CRAM sample pairings which works well as a temporary measure.

find . -type f -exec awk -v x=10 'NR==x{exit 1}' {} \; -exec rm -f {} \;

Best, Marcus