Gregor-Mendel-Institute / SNPmatch

A simple python library to identify the most likely strain from the population
https://arageno.gmi.oeaw.ac.at/
MIT License
9 stars 5 forks source link

Different results when input is shuffled #13

Closed davetang closed 5 years ago

davetang commented 5 years ago

I have installed SNPmatch using docker-miniconda and created several tests. If you want more information, I can provide all the steps I used to produce the examples below. For now, I will just show the relevant steps.

I created eg3.bed, which is completely identical to sample 1 in sample.vcf and the results are as expected.

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | head -1000 | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1   1000    1000    1.0 1.0 1.0 1000    NA
2   504 1000    0.504   8443.54247132949    8443.54247132949    1000    NA
3   533 1000    0.533   7911.4903138388245  7911.4903138388245  1000    NA
4   504 1000    0.504   8443.54247132949    8443.54247132949    1000    NA
5   491 1000    0.491   8683.14132921334    8683.14132921334    1000    NA
6   500 1000    0.5 8517.193193903857   8517.193193903857   1000    NA
7   483 1000    0.483   8830.922877708486   8830.922877708486   1000    NA
8   497 1000    0.497   8572.47323619864    8572.47323619864    1000    NA
9   508 1000    0.508   8369.95575353433    8369.95575353433    1000    NA
10  500 1000    0.5 8517.193193903857   8517.193193903857   1000    NA

However, when I shuffle the input using shuf (this just randomises the lines) the results are not as expected. The following steps are identical apart from running shuf.

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | head -1000 | shuf | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1   496 1000    0.496   8590.907917160912   1.0637249543580847  1000    NA
2   504 1000    0.504   8443.54247132949    1.045478186536507   1000    NA
3   499 1000    0.499   8535.61587463412    1.056878701787307   1000    NA
4   524 1000    0.524   8076.249299185785   1.0 1000    NA
5   501 1000    0.501   8498.774513176264   1.0523170098319123  1000    NA
6   508 1000    0.508   8369.95575353433    1.0363666899656199  1000    NA
7   521 1000    0.521   8131.2411580875205  1.006809083878488   1000    NA
8   469 1000    0.469   9090.157529759197   1.1255419679374719  1000    NA
9   522 1000    0.522   8112.906530450911   1.004538892982021   1000    NA
10  514 1000    0.514   8259.695714936273   1.0227143082085124  1000    NA

If I print every second line (thus the input is still sorted) instead of shuffling, I also get unexpected results.

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | perl -nle "if ($. % 2 == 0){ print }" | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1   514 1000    0.514   8259.695714936273   1.0368199969939522  1020    NA
2   490 1000    0.49    8701.600014528602   1.0922911947702092  1020    NA
3   501 1000    0.501   8498.774513176264   1.066830991033872   1020    NA
4   530 1000    0.53    7966.373853594235   1.0 1020    NA
5   487 1000    0.487   8757.000081471551   1.099245433670001   1020    NA
6   484 1000    0.484   8812.436172983842   1.1062041946484704  1020    NA
7   492 1000    0.492   8664.686645197171   1.0876575471395777  1020    NA
8   512 1000    0.512   8296.433052811099   1.041431547813683   1020    NA
9   474 1000    0.474   8997.483502817286   1.1294327467142204  1020    NA
10  476 1000    0.476   8960.441974174311   1.1247830115494237  1020    NA

Even more perplexing is when I shuffle and re-sort the input, I get another result!

rm -rf example/eg3
mkdir example/eg3
cat raw/sample.vcf | grep -v "^#" | head -1000 | shuf | sort -k2,2n | cut -f1,2,10 > example/eg3/eg3.bed
snpmatch inbred -v -i example/eg3/eg3.bed -d db.hdf5 -e db.acc.hdf5 -o example/eg3/eg3
cat example/eg3/eg3.scores.txt
1   992 1000    0.992   100.7710316093164   1.0 1000    NA
2   506 1000    0.506   8406.741111258392   83.42418428195569   1000    NA
3   529 1000    0.529   7984.676397077538   79.23583067040217   1000    NA
4   504 1000    0.504   8443.54247132949    83.78938209211381   1000    NA
5   493 1000    0.493   8646.235962207935   85.80080826927428   1000    NA
6   502 1000    0.502   8480.359832467335   84.15473868864628   1000    NA
7   487 1000    0.487   8757.000081471551   86.89997454250489   1000    NA
8   495 1000    0.495   8609.346598381862   85.43473715501706   1000    NA
9   506 1000    0.506   8406.741111258392   83.42418428195569   1000    NA
10  500 1000    0.5 8517.193193903857   84.52025406393113   1000    NA

Do you know what's going on?

davetang commented 5 years ago

Here's my R Markdown file for how I installed SNPmatch and created the examples.

rbpisupati commented 5 years ago

Hi @davetang ,

SNPmatch does all the calculations using numpy arrays in database and sample files and assumes the positions in VCF or BED as sorted arrays. You need to provide in sorted input file (and also for positions in database file).

davetang commented 5 years ago

The last two examples show sorted input but the result is not as expected. I am expecting 1000 out of 1000 matches for sample 1.

rbpisupati commented 5 years ago

Thank you for making a detailed markdown file, makes life easier. I think the problem is duplicated entries in sample.vcf. You have multiple lines at SNP positions 5616, 6398, 10829 etc. Can you remove them and try running SNPmatch again?

davetang commented 5 years ago

Well spotted; that was the problem. Thank you!