gaow / SEQLinkage

Collapsed Haplotype Pattern Method for Linkage Analysis of Next-Generation Sequencing Data
MIT License
7 stars 6 forks source link

SEQLinkage output run in MERLIN results in "Requires impossible recombination pattern" Error #19

Open oleraj opened 4 years ago

oleraj commented 4 years ago

Hi,

I'm running SEQLinkage on 6 families with an autosomal dominant trait we have previously mapped. I'm using this set of families to test SEQLinkage to see if we can recapitulate the results. I have no issues running SEQLinkage:

seqlink --fam ${base}.ped --vcf ${base}.recode.ex.16.vcf.gz --freq ExAC_AF -f MERLIN --blueprint ../SEQLinkage-master/data/genemap.hg19.txt
...
MESSAGE: Checking local resources 5/5 ...                                                                
MESSAGE: 31 samples found in ...
MESSAGE: 9 samples found in FAM file but not in VCF file:
Gma10
Gma17
Dad9
Dad18
Gpa9
mGma22
Dad10
Mom11
pGma22
MESSAGE: 6 families with a total of 31 samples will be scanned for 28,488 pre-defined units
MESSAGE: 573 units (from 2,271 variants) processed; 0 Mendelian inconsistencies and 424 recombination events handled
MESSAGE: 27,907 units ignored due to absence in VCF file
MESSAGE: 8 units ignored due to absence of variation in samples
MESSAGE: Archiving regional marker data to directory 
MESSAGE: 573 units will be converted to MERLIN format
MESSAGE: 573 units successfully converted to MERLIN format
MESSAGE: Archiving MERLIN format to directory ...
MESSAGE: Saving data to ...

However when I run the results in MERLIN all of the families are skipped:

i=16; merlin -d LINKAGE/MERLIN/LINKAGE.chr${i}.dat -p LINKAGE/MERLIN/LINKAGE.chr${i}.ped -m LINKAGE/MERLIN/LINKAGE.chr${i}.map --model model | awk '$0 ~ /Para|HLOD/ || $4 > 1.0 {print}'; 
References for this version of Merlin:
The following parameters are in effect:
                     Data File : LINKAGE/MERLIN/LINKAGE.chr16.dat (-dname)
                 Pedigree File : LINKAGE/MERLIN/LINKAGE.chr16.ped (-pname)
            Missing Value Code :         -99.999 (-xname)
                      Map File : LINKAGE/MERLIN/LINKAGE.chr16.map (-mname)
            Allele Frequencies : ALL INDIVIDUALS (-f[a|e|f|m|file])
                   Random Seed :          123456 (-r9999)
          Limits : --bits [24], --megabytes, --minutes
Estimating allele frequencies... [using all genotypes]
Done estimating frequencies for 703 markers
Analysis models will be retrieved from file [model]...
Table processed. 1 models recognized
Family: Family-22 - Founders: 5  - Descendants: 7  - Bits: 10      
  SKIPPED: Requires impossible recombination pattern
Family: Family-868 - Founders: 3  - Descendants: 4  - Bits: 5 
  SKIPPED: Requires impossible recombination pattern
Family: Family-2435 - Founders: 3  - Descendants: 3  - Bits: 3 
  SKIPPED: Requires impossible recombination pattern
Family: Family-2439 - Founders: 3  - Descendants: 3  - Bits: 3 
  SKIPPED: Requires impossible recombination pattern
Family: Family-2443 - Founders: 3  - Descendants: 2  - Bits: 2 
  SKIPPED: Requires impossible recombination pattern
Family: Family-2452 - Founders: 2  - Descendants: 2  - Bits: 2 
  SKIPPED: Requires impossible recombination pattern
Parametric Analysis, Model Rare_Dominant
       POSITION        LOD      ALPHA       HLOD

The documentation for MERLIN says this impossible recombination pattern error is often a result of multiple markers having the same position, which is indeed the case in the SEQLinkage output. For example:

grep PTX4 LINKAGE/MERLIN/LINKAGE.chr16.map
16  PTX4[1] 5.180964049091157   9.445039156840059   1.1235577008138633
16  PTX4[2] 5.180964049091157   9.445039156840059   1.1235577008138633

It appears that some of the markers are being split into two separate bins, but the position is the same for both, which is apparently a problem for MERLIN. It seems that SEQLinkage should construct a new position for each split marker based to avoid issues with MERLIN. I'm using the default option for --bin, which uses LD to bin the variants. Is there any way to use this LD binning method that will produce output suitable for MERLIN, i.e., different position for each marker?

Thanks!

Andrew

oleraj commented 4 years ago

I also tried using --bin 0 option and that also splits the markers so it appears the --bin option is not being recognized or I'm misunderstanding how to use it. --bin 1 produces identical results as well.

grep PTX4 LINKAGE/MERLIN/LINKAGE.chr16.map
16  PTX4[1] 5.180964049091157   9.445039156840059   1.1235577008138633
16  PTX4[2] 5.180964049091157   9.445039156840059   1.1235577008138633

I would like to have optimized binning into haplotypes with a separate position for each haplotype. Any suggestions on how to do that would be appreciated.

gaow commented 4 years ago

Thanks @oleraj for the report. This issue is related to recombination events which we most recently discussed in a related project:

https://github.com/statgenetics/rvnpl/issues/2

We should make some updates this week on that issue, depending on @percylinhai 's schedule. I suppose we can try fix the genetic map distance along the line. I'll discuss with @percylinhai and make adjustments. Options --bin happens after these sub-regional markers are generated.

oleraj commented 4 years ago

Great, thanks for the reply and glad to hear you're planning some updates on this.

gaow commented 4 years ago

@oleraj sorry we've done this a while ago but forgot to update you here. @percylinhai has implemented a fix that should give different genetic distance for subregions, based on their physical positions. Please try pulling our docker images again and test.