PalamaraLab / ASMC

Ascertained Sequentially Markovian Coalescent
GNU General Public License v3.0
12 stars 1 forks source link

Genetic map file format unclear #5

Open stephenturner opened 2 years ago

stephenturner commented 2 years ago

Documentation on the fastsmc map file (https://github.com/PalamaraLab/ASMC/blob/main/docs/fastsmc.md#genetic-map-map) are unclear:

The genetic map file needs to provide physical positions (in base pairs) and genetic positions (in centimorgans). FastSMC expects a tab separated file containing at least 3 columns, with physical positions in the first column and genetic positions in the third column. The second column may contain any values. The file may be optionally compressed using gzip.

Which column should be the chromosome column?

This description differs from the ASMC genetic map file documentation (https://github.com/PalamaraLab/ASMC/blob/main/docs/asmc.md#genetic-map-mapmapgz):

The genetic map provided in input to ASMC is in Plink map format, in which each line has four columns with format "Chromosome SNPName GeneticPosition PhysicalPosition". Genetic positions are in centimorgans, physical positions are in bp. The map can be optionally compressed using gzip.

Thanks for any help 🙏

fcooper8472 commented 2 years ago

Hi @stephenturner, thanks for your question.

As you have identified, the map files used by ASMC and FastSMC are different. The genetic map file for FastSMC needs to be in "HapMap" format (not to be confused with the Plink format, which is used for ASMC).

This "HapMap" format should provide physical positions (in base pairs) and genetic positions (in centimorgans). Example maps can be found here and here. This map format does not have a column for chromosome.

For historical reasons the physical positions and genetic positions are in columns 1 and 3, and for FastSMC the contents of column 2 are not used and therefore any data can exist in that column.

Please let me know if that answers your question.

stephenturner commented 2 years ago

Thanks for the quick response @fcooper8472! I'm still a bit confused, even after looking at the example.

I have a phased VCF with autosomal SNPs on chr 1-22 that I've converted to .hap.gz+.sample using bcftools convert --hapsample. Trying to run FastSMC on this data.

I'm not clear on how FastSMC will do this if there is no specification of which chromosome the physical and genetic positions correspond to. I'm looking at the two examples you pointed me to.

The first, from odelaneau/shapeit4, has separate files for each chromosome, and each individual file has chromosome in column 2:

wget https://github.com/odelaneau/shapeit4/raw/master/maps/genetic_maps.b37.tar.gz
tar -xzf genetic_maps.b37.tar.gz
ls
# chr10.b37.gmap.gz
# chr11.b37.gmap.gz
# chr12.b37.gmap.gz
# chr13.b37.gmap.gz
# chr14.b37.gmap.gz
# chr15.b37.gmap.gz
# chr16.b37.gmap.gz
# chr17.b37.gmap.gz
# chr18.b37.gmap.gz
# chr19.b37.gmap.gz
# chr1.b37.gmap.gz
# chr20.b37.gmap.gz
# chr21.b37.gmap.gz
# chr22.b37.gmap.gz
# chr2.b37.gmap.gz
# chr3.b37.gmap.gz
# chr4.b37.gmap.gz
# chr5.b37.gmap.gz
# chr6.b37.gmap.gz
# chr7.b37.gmap.gz
# chr8.b37.gmap.gz
# chr9.b37.gmap.gz
# chrX.b37.gmap.gz
# chrX_par1.b37.gmap.gz
# chrX_par2.b37.gmap.gz
# genetic_maps.b37.tar.gz

gzip -dc chr1.b37.gmap.gz | head
## pos  chr     cM
## 55550        1       0.000000
## 82571        1       0.080572
## 88169        1       0.092229
## 254996       1       0.439456
## 564598       1       1.478148
## 564621       1       1.478214
## 565433       1       1.480558
## 568322       1       1.488889
## 568527       1       1.489481

The second example from the fastsmc example shows a slightly different format. Four columns, with physical position in the first, and genetic position in the third. Presumably the second and fourth columns are ignored.

wget https://github.com/PalamaraLab/ASMC_data/raw/main/examples/fastsmc/example.map
head example.map
# 1     0.5043954585    0       1.65e-8
# 2498  0.4706272937    0.00117516      1.65e-8
# 2679  0.4287316372    0.00125276      1.65e-8
# 2710  0.3838002191    0.00126465      1.65e-8
# 3118  0.3410386606    0.0014038       1.65e-8
# 3944  0.3123232067    0.00166178      1.65e-8
# 5681  0.3104991665    0.00220111      1.65e-8
# 5863  0.3094242932    0.00225743      1.65e-8
# 8250  0.3145128595    0.00300817      1.65e-8
# 8501  0.3351424147    0.00309229      1.65e-8

However, is this just the map position for chromosome 1 only? When I look at the example.hap.gz, it looks like there are only variants on chr1:

wget https://github.com/PalamaraLab/ASMC_data/raw/main/examples/fastsmc/example.hap.gz
zcat example.hap.gz | cut -d' ' -f1-3 | tail
# 1:29969491_1_2 SNP_29969491_98302 29969491
# 1:29969681_1_2 SNP_29969681_98303 29969681
# 1:29972852_1_2 SNP_29972852_98308 29972852
# 1:29973166_1_2 SNP_29973166_98315 29973166
# 1:29981007_1_2 SNP_29981007_97289 29981007
# 1:29986798_1_2 SNP_29986798_98319 29986798
# 1:29988624_1_2 SNP_29988624_62119 29988624
# 1:29990616_1_2 SNP_29990616_98335 29990616
# 1:29999062_1_2 SNP_29999062_98322 29999062
# 1:29999458_1_2 SNP_29999458_98331 29999458

My first thought was that FastSMC requires analyses to be run separately on each chromosome, but looking at the output file format documentation, it notes that column 6 is chromosome number, which makes me think that analysis could be done on all chromosomes at once. But if that's true it's unclear to me how the map file will specify which chromosome corresponds to rows with physical/genetic positions.