FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

Nmasked human genome #55

Closed fanghe0720 closed 2 years ago

fanghe0720 commented 2 years ago

Update: I think the problem is in the format of my SNP files. I used the one from mouse genome and replaced the content. Everything is good now. Sorry for the inconvenience and thank you very much!

Hi,

I'm trying to prepare an N-masked human genome with SNPsplit. I have read the issues #22 in the thread on a similar topic and I decided to follow your suggestion to use --skip_filtering option. I used the strain name 'SPRET_EiJ' for convenience and put my SNP files to a folder named 'SNPs_SPRET_EiJ'. However I still got problems as 0 positions are changed to N per chromosome. Could you help to check where is the problem?

My command is ../SNPsplit-0.3.4/SNPsplit_genome_preparation --nmasking --skip_filtering --vcf_file sample.vcf.gz --reference_genome ../hg38_genome/ --strain SPRET_EiJ --genome_build hg38_Nmasked

My vcf file is downloaded from https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-common_all.vcf.gz

My reference genome is downloaded from http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa.gz. All 'chr' are replaced by ''.

My SNP files are in the below format. rs544419019 1 11012 C/G rs561109771 1 11063 T/G rs540538026 1 13110 G/A rs62635286 1 13116 T/G rs62028691 1 13118 A/G rs531730856 1 13273 G/C rs548333521 1 13284 G/A rs571093408 1 13380 C/G rs568927457 1 13453 T/C rs546169444 1 14464 A/T

My log file is

Reading/filtering VCF file: No (skipped by user)
Reference genome: ../hg38_genome/
N-masking: Yes
Full SNP genome: No
SNP strain: SPRET_EiJ

Using the following chromosomes (HARCODED IN!!!):
1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22
X Y MT

Skipped reading the VCF file and filtering SNPs again (specified by user)

Now reading in and storing sequence information of the genome specified in: ../hg38_genome/ chr 1 (248956422 bp) chr 2 (242193529 bp) chr 3 (198295559 bp) chr 4 (190214555 bp) chr 5 (181538259 bp) chr 6 (170805979 bp) chr 7 (159345973 bp) chr 8 (145138636 bp) chr 9 (138394717 bp) chr 10 (133797422 bp) chr 11 (135086622 bp) chr 12 (133275309 bp) chr 13 (114364328 bp) chr 14 (107043718 bp) chr 15 (101991189 bp) chr 16 (90338345 bp) chr 17 (83257441 bp) chr 18 (80373285 bp) chr 19 (58617616 bp) chr 20 (64444167 bp) chr X (156040895 bp) chr Y (57227415 bp) chr M (16569 bp) Processing chromosome 1 (for strain SPRET_EiJ)
Reading SNPs from file /net/noble/vol8/hefang2/hg38.Nmasked.common/SNPs_SPRET_EiJ/chr1.txt Clearing SNP array... Writing modified chromosome (N-masking) Writing N-masked output to: /net/noble/vol8/hefang2/hg38.Nmasked.common/SPRET_EiJ_N-masked/chr1.N-masked.fa 0 SNPs total for chromosome 1 0 positions on chromosome 1 were changed to 'N'

Processing chromosome 2 (for strain SPRET_EiJ)
Reading SNPs from file /net/noble/vol8/hefang2/hg38.Nmasked.common/SNPs_SPRET_EiJ/chr2.txt Clearing SNP array... Writing modified chromosome (N-masking) Writing N-masked output to: /net/noble/vol8/hefang2/hg38.Nmasked.common/SPRET_EiJ_N-masked/chr2.N-masked.fa 0 SNPs total for chromosome 2 0 positions on chromosome 2 were changed to 'N'

FelixKrueger commented 2 years ago

Excellent, I am glad it worked now - especially since it didn't require anything from my side :P Good luck!