brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
262 stars 35 forks source link

Relatedness color pattern issue in plot #126

Open mirgin01 opened 11 months ago

mirgin01 commented 11 months ago

Hello! I'm having some issues understanding how the relatedness is represented in the html plots.

For instance, this pair is colored as related in the html: sample1 sample2 0.417 0 8368 0.688 4580 4767 7669 1597 5659 5584 3837 12842 115 237 0.5

Whereas this one is marked as unrelated: sample3 sample4 0.409 0 8367 0.685 4596 4784 7706 1575 5623 5654 3850 12922 126 210 -1.0

Why in the first example the expected_relatedness is 0.5 (as expected), but -1 in the second one? The relatedness value is almost the same, and I can't find any major differences. I'm having this issue with many examples.

We are running Somalier with these commands:

somalier extract -d extracted/ --sites sites.hg38.vcf.gz -f GRCh38_full_1000genomes.fa $BAM somalier relate --infer extracted/.somalier somalier relate --ped somalier.samples.tsv extracted/.somalier

Please, could you clarify why this happens?

Let me know if you need further information. Thank you very much!! Mireia

brentp commented 11 months ago

The infer method has some limitations. You might try running infer again with the pedigree file generated the first time.

Can you also show here the rows for sample1-4 in the samples.tsv file?

mirgin01 commented 11 months ago

Thanks for your quick answer!

These are the samples.tsv: sample2 sample2 -9 -9 1 -9 male 49.8 17.1 49.8 17.1 0.51 0.40 4549 4767 5584 2484 0.007 24.35 359 180 0 179 19.45 16 sample3 sample3 -9 -9 1 -9 male 37.0 13.3 36.9 13.3 0.51 0.41 4559 4596 5622 2608 0.007 18.17 348 164 0 185 12.35 17 sample4 sample4 -9 -9 1 -9 male 55.6 18.6 55.6 18.6 0.51 0.40 4601 4785 5654 2345 0.006 26.99 353 178 0 176 21.18 17 sample1 sample1 sample2 sample5 1 -9 male 42.6 15.2 42.6 15.2 0.51 0.41 4591 4580 5659 2555 0.006 21.14 359 180 0 178 21.69 16

We've tried running again the infer --ped with the samples.tsv, but the results are the same. When you talk about limitations, do you mean limitations related to how the relatedness is calculated or while plotting the results?

brentp commented 11 months ago

What is sample1 repeated in your samples.tsv? Did you send that sample to the program twice?

I mean there are limitations in how the inference is done. Especially with lower quality samples.

mirgin01 commented 11 months ago

Sorry, that was my mistake! I haven't used twice sample1, it's a copy/paste error. I've modified the original message.

In my case, I agree with the relatedness score it calculates, but I don't understand why sometimes is colored as related in the plot, and some times it isn't, being the relatedness score almost the same one.

Thanks for looking into this!

brentp commented 11 months ago

If the number of samples is manageable, you can edit the samples file to update the family id and parents where it's obvious to you what's going on.

mirgin01 commented 10 months ago

Hello! I've just found this really useful page https://github.com/brentp/somalier/wiki/pedigree-inference, which solved the doubts I was having. I just wanted to ask why you set up that parents must have a relatedness of < 0.06 to each other? Is it standard? This was the threshold that my samples weren't achiving and that was causing them to be tagged as unrelated. Thanks you very much !