brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

missing pairs - somalier relate v. >= 0.2.7 #76

Closed LolloPero closed 2 years ago

LolloPero commented 3 years ago

I am running somalier -relate and compared the pairs.tsv output across different versions of somalier package.

The input to somalier relate is n=875 .somalier files, then the corresponding pairs.tsv output file should contain the relatedness between all possible pairings n=382.375 (no repetition, order does not count).

Yet, this is true only for some versions, and NOT for the latest (v 0.2.13): somlaier_relate_comparison

brentp commented 3 years ago

Hi, I agree this is confusing, somalier writes a message to stderr that it won't write all sample-pairs to file with large numbers. I changed the cutoff for this to occur in e8a2c291 from 400K to 200K samples so you were between those and therefore see a difference.

There is some randomness and that's why you see mildly different numbers between versions -- you'd also see this if you ran the same version multiple times.

Let me know if this answers your question or if you have suggestions to make it less jarring. I suppose I could see the random number generator with a fixed number so you'd get identical results post v0.2.6.

LolloPero commented 3 years ago

Hi, thank you for clarifying this.

As I understood it, when the number of sample-pairs is above the cutoff, some pairings will be left out for the sake of the html plot.

Is there a way to force somalier relate to write down all pairings to the pairs.tsv outupt file? Or could this be implemented next?

Thanks

brentp commented 3 years ago

Yes, I can probably make this a hidden option. Just to clarify, somalier will only skip writing pairs that are both:

  1. unrelated by genotypes (calculated rel < 0.05)
  2. expected to be unrelated (no relatedness in pedigree file)

all other pairs will be written.

LolloPero commented 2 years ago

Hi Brent,

I would like to follow up on this thread and ask if it would be possible to implement a silent option to output all pairings in the .group.tsv file, regardless of the 2 conditions written above (1. unrelated by genotypes 2. expected to be unrelated).

It would be great if tis could be done in the latest version (0.2.13).

Thanks :)

brentp commented 2 years ago

Hi, this is available in the latest release by setting the environment variable SOMALIER_REPORT_ALL_PAIRS to a non empty value, e.g. export SOMALIER_REPORT_ALL_PAIRS=true