Position inconsistency - Githubissues

azzaea commented 4 years ago

Hello,

I'm now looking at the truffle.segments output file and trying to retrieve the actual markers IDs that border those segments. To do so, I'm looking at the VARSTART and POS fields in this file. My understanding from the documentation, :

VARSTART : the index of marker on the start of the segment. This corresponds to the n’th marker in the VCF file
POS : Position of the start of the segment in MBpairs.

Is: if VARSTART lists some variant i, then (assuming no filtering), this should mean the segment starts at the variant in line i of the vcf file (ignoring the vcf header, of course). Therefore, I expect the entry for POS that truffle reports to actually be POS[line i] in this vcf file (in Mbp).

Looking closely at the truffle.segments output and the original vcf file, the POS entries in those files do not coincide (there is some non-deterministic shift between the two).

Below, I'm reporting one such case, where I'm uploading the files of segments from chromosome 1 to Dropbox (these are from 1KGP data, so are in a sense public as is).

In the segments file, the first reported IBD segment has VARSTART=2659 and POS=165.0135 Mbp
The vcf file has a 7-lines header, so the corresponding variant should be in line [2659+(7)=2666]. This corresponds to the variant with ID=rs2136241 at POS=163289571. (so the POS fields do not coincide).
Strangely though, in the vcf file above, there is a variant in POS=165013535, with ID=rs7524769. This variant is in line 2708, meaning it has index [2708-(7)=2701].
As you see, there is a mismatch between those indexes
This phenomenon of seemingly unpredictable indexes/positions repeats across the segments, where the index mismatch could be as low as 1, or have some other value.

Maybe I'm missing something? Would you, therefore, help demystify the mechanism Truffle uses to calculate entries for POS?

As always, thank you in advance for your help.

Best, Azza

adimitromanolakis commented 4 years ago

Hi Azza,

This might be related to the filtering and one bug currently existing in the filtering options.

As it stands now , the positions will not match the vcf file if there is any filtering applied. We plan to fix this issue in the next release of truffle ( ~ October ).

If you are using the option --nofiltering, there is a bug currently which prevents it from disabling all filtering.

Can you try using the options : --maf 0 --missing 0 ? If this does not fix the issue, I will investigate further. Also please send me the command line options you are using for truffle?

Apostolos

azzaea commented 4 years ago

Hello Apostolos,

Thank you for getting back on this. I used truffle with default arguments (no filtering), and my command was:

truffle --vcf ${inputvcf} --cpu 12 --segments

Admittedly, the input is a single chromosome dataset, so I should have fiddled with the -L parameter, but I honestly couldn't tell how results are better when changing it so didn't really try to edit.

With the new flags, almost all the IBD segments are gone! The segments file is literally a list of just 8 segments (I'm using the same vcf file uploaded to this ticket before). The position glitch still manifests. The command and result are below:

$ truffle --vcf ${inputvcf} --cpu 12 --segments --maf 0 --missing 0
$ cat tuffle.segments
TYPE              ID1             ID2  CHROM  VARSTART   VAREND              POS Mbp      LENGTH Mbp    NMARKERS
IBD1     1340_NA07056    1341_NA06993      1       421      668         166.3053 Mbp     69.8937 Mbp         248
IBD1     1447_NA12761   13292_NA07031      1       498      730         199.2402 Mbp     47.5986 Mbp         233
IBD1     2418_NA19818    ASW4_NA20278      1        41      280          15.1029 Mbp     73.5309 Mbp         240
IBD1   LWK003_NA19432 NA19360_NA19360      1       386      698         156.3691 Mbp     84.5343 Mbp         313
IBD1  NA21455_NA21455 NA21494_NA21494      1       472      702         186.1548 Mbp     55.7265 Mbp         231
IBD1  NA21487_NA21487 NA21581_NA21581      1       324      576         108.0882 Mbp    107.0404 Mbp         253
IBD1     PR46_HG01197    PR49_HG01205      1       254      522          79.3056 Mbp    123.3759 Mbp         269
IBD1  NA19390_NA19390    Y092_NA19257      1         5      236           3.2725 Mbp     67.0631 Mbp         232

I understand your plan and the needed work, but is it safe to take the variant order from the first column and using it as index in the original vcf? I can live with this.

Thank you in advance for your help!

Best, Azza

adimitromanolakis commented 4 years ago

Hi Azza,

I made a mistake in my previous email: the correct arguments to disable the filtering is:

--maf 0 --missing 1

with those arguments, filtering is disabled and you can take the variant order as an index to the vcf file. Otherwise the variant position will not match exactly, because of the filtering. We will make some improvements so this is correct in the next release.

If you still have some problems with the above options, let me know.

Best, Apostolos

azzaea commented 4 years ago

Hi Apostolos,

Fiar enough, makes sense. I think this setup works (there are more IBD segments, so filtering seems off; and now, POS in the segments file corresponds with the POS in the vcf file of variant[VARSTART+vcf header length + 1] ). This is adequate enough, thank you.

Since we are here though, could you elaborate a bit on how to tweak parameters when working with single chromosome vcfs? I mean, the documentation advises to fiddle with -L, --ibs1markers and --ibs2markers, but not sure what we are trying to optimize here (or if there are recommended values to use for these knobs).

Thanks again.

Best, Azza

adimitromanolakis commented 4 years ago

Hi Azza,

For the single chromosome case, the most flexible way is to use --ibs1/2markers. For ibs1 segments you would at least 500-600 markers to establish a low false positive rate. If you are not interested in small segments, I would start by setting the limit to be equivalent to ~5MB. You can do that by computing the density of the markers in your chromosome file and figuring out how much the segment length in terms of number of markers should be. For example:

Length of chromosome: 200 Mb Number of markers: 100000 Density: 200e6/100e3 = 1 marker every 2000 bp Number of markers in a 5MB segment: 5e6 / 2000 = 2500

So I would start with ibs1markers to be 2500. This all depends on the allele frequencies and the LD between the markers. Ideally for truffle, you wouldn't want regions that have vastly higher or lower marker density and probably it's best to filter out variants with very low maf (<1%).

For ibs2markers, you can set a limit that is lower 2-3 times.

Ideally, if you want to be exact and find all 5MB segments, the strategy is to set a lower limit and do post-filtering for the segments based on physical or genetic map length.

We will also work on adding some easy way to specify the parameters automatically for single chromosome files.

Best, Apostolos

adimitromanolakis / truffle

Position inconsistency #3