FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

prepareGenome #38

Closed nservant closed 3 years ago

nservant commented 4 years ago

Hi, When I run the SNPsplit_genome_preparation script on the complete Mouse genome (base chromosomes + all scaffolds/fixes), with --no_nmasking, the full_sequence output contains only the base chromosome.

My genome reference comes from ;

 ftp://ftp.ensembl.org/pub/release-98/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz

>>grep ">" Mus_musculus.GRCm38.dna.toplevel.fa 
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
>2 dna:chromosome chromosome:GRCm38:2:1:182113224:1 REF
>3 dna:chromosome chromosome:GRCm38:3:1:160039680:1 REF
>4 dna:chromosome chromosome:GRCm38:4:1:156508116:1 REF
>5 dna:chromosome chromosome:GRCm38:5:1:151834684:1 REF
>6 dna:chromosome chromosome:GRCm38:6:1:149736546:1 REF
>7 dna:chromosome chromosome:GRCm38:7:1:145441459:1 REF
>8 dna:chromosome chromosome:GRCm38:8:1:129401213:1 REF
>9 dna:chromosome chromosome:GRCm38:9:1:124595110:1 REF
>10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
>11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
>12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
>13 dna:chromosome chromosome:GRCm38:13:1:120421639:1 REF
>14 dna:chromosome chromosome:GRCm38:14:1:124902244:1 REF
>15 dna:chromosome chromosome:GRCm38:15:1:104043685:1 REF
>16 dna:chromosome chromosome:GRCm38:16:1:98207768:1 REF
>17 dna:chromosome chromosome:GRCm38:17:1:94987271:1 REF
>18 dna:chromosome chromosome:GRCm38:18:1:90702639:1 REF
>19 dna:chromosome chromosome:GRCm38:19:1:61431566:1 REF
>X dna:chromosome chromosome:GRCm38:X:1:171031299:1 REF
>Y dna:chromosome chromosome:GRCm38:Y:1:91744698:1 REF
>MT dna:chromosome chromosome:GRCm38:MT:1:16299:1 REF
>CHR_MG171_PATCH dna:chromosome chromosome:GRCm38:CHR_MG171_PATCH:1:151834685:1 PATCH_FIX
>CHR_MG4222_MG3908_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4222_MG3908_PATCH:1:94987243:1 PATCH_FIX
>CHR_MG51_PATCH dna:chromosome chromosome:GRCm38:CHR_MG51_PATCH:1:156507375:1 PATCH_FIX
>CHR_MG3496_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3496_PATCH:1:195440828:1 PATCH_FIX
>CHR_MG4200_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4200_PATCH:1:94983374:1 PATCH_FIX
>CHR_MG4243_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4243_PATCH:1:156484188:1 PATCH_FIX
>CHR_MG4209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4209_PATCH:1:91793962:1 PATCH_FIX
>CHR_MG74_PATCH dna:chromosome chromosome:GRCm38:CHR_MG74_PATCH:1:104052134:1 PATCH_FIX
>CHR_MG4310_MG4311_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4310_MG4311_PATCH:1:156656003:1 PATCH_FIX
>CHR_MG4249_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4249_PATCH:1:61433356:1 PATCH_FIX
>CHR_MG3833_MG4220_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3833_MG4220_PATCH:1:98208654:1 PATCH_FIX
>CHR_MG3231_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3231_PATCH:1:171029545:1 PATCH_FIX
>CHR_MG4151_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4151_PATCH:1:145439975:1 PATCH_FIX
>CHR_MG104_PATCH dna:chromosome chromosome:GRCm38:CHR_MG104_PATCH:1:170913546:1 PATCH_FIX
>CHR_MMCHR1_CHORI29_IDD5_1 dna:chromosome chromosome:GRCm38:CHR_MMCHR1_CHORI29_IDD5_1:1:195506435:1 PATCH_NOVEL
>CHR_MG3700_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3700_PATCH:1:90658154:1 PATCH_FIX
>CHR_MG3530_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3530_PATCH:1:130695022:1 PATCH_FIX
>CHR_MG4261_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4261_PATCH:1:103906836:1 PATCH_FIX
>CHR_MG3251_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3251_PATCH:1:145419646:1 PATCH_FIX
>CHR_MG3562_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3562_PATCH:1:156470354:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG4 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG4:1:122190308:1 PATCH_NOVEL
>CHR_WSB_EIJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG2:1:122242168:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG2 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG2:1:122246885:1 PATCH_NOVEL
>CHR_MG3648_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3648_PATCH:1:104165524:1 PATCH_FIX
>CHR_MG3618_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3618_PATCH:1:156477076:1 PATCH_FIX
>CHR_CAST_EI_MMCHR11_CTG5 dna:chromosome chromosome:GRCm38:CHR_CAST_EI_MMCHR11_CTG5:1:122035401:1 PATCH_NOVEL
>CHR_PWK_PHJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_PWK_PHJ_MMCHR11_CTG3:1:122032376:1 PATCH_NOVEL
>CHR_MG4136_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4136_PATCH:1:156508116:1 PATCH_FIX
>CHR_MG4138_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4138_PATCH:1:130620757:1 PATCH_FIX
>CHR_MG3835_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3835_PATCH:1:90835696:1 PATCH_FIX
>CHR_MG89_PATCH dna:chromosome chromosome:GRCm38:CHR_MG89_PATCH:1:159939961:1 PATCH_FIX
>CHR_MG4213_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4213_PATCH:1:91736668:1 PATCH_FIX
>CHR_MG3829_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3829_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG209_PATCH dna:chromosome chromosome:GRCm38:CHR_MG209_PATCH:1:94987270:1 PATCH_FIX
>CHR_WSB_EIJ_MMCHR11_CTG3 dna:chromosome chromosome:GRCm38:CHR_WSB_EIJ_MMCHR11_CTG3:1:122041104:1 PATCH_NOVEL
>CHR_MG4308_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4308_PATCH:1:122082543:1 PATCH_FIX
>CHR_MG3609_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3609_PATCH:1:156568640:1 PATCH_FIX
>CHR_MG4180_PATCH dna:chromosome chromosome:GRCm38:CHR_MG4180_PATCH:1:120129530:1 PATCH_FIX
>CHR_MG3686_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3686_PATCH:1:170897390:1 PATCH_FIX
>CHR_MG65_PATCH dna:chromosome chromosome:GRCm38:CHR_MG65_PATCH:1:61442615:1 PATCH_FIX
>CHR_MG3627_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3627_PATCH:1:124903046:1 PATCH_FIX
>CHR_MG3999_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3999_PATCH:1:195424274:1 PATCH_FIX
>CHR_MG3699_PATCH dna:chromosome chromosome:GRCm38:CHR_MG3699_PATCH:1:90657263:1 PATCH_FIX

Command line ;

SNPsplit_genome_preparation --strain CAST_EiJ --reference_genome genome/ --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf --no_nmasking

Output ;

>>grep ">" CAST_EiJ_maternal_genome.fa 
>10
>11
>12
>13
>14
>15
>16
>17
>18
>19
>1
>2
>3
>4
>5
>6
>7
>8
>9
>MT
>X
>Y

I think it would be good to export all chromosomes, even if there have no SNPs. From ENSEMBLE ; Fix patches: provide improved sequence for known assembly errors. These patches will be incorporated into the primary assembly in the next major assembly release. They are coloured green in the Chromosome summary page and Region in detail page. They are improvements on the primary assembly and should be used preferentially over the primary assembly.

Thanks @FelixKrueger ! Nicolas

FelixKrueger commented 4 years ago

Hi Nicolas,

I have now tried to change the behaviour to print out all chromosomes, even if they were not covered by SNPs. Could you give it a whirl and see if it appears to do what you wanted? Addressed here: 9a81c16576a88a7d0e83b4c19a9d3e6b3d9ed4c9

nservant commented 4 years ago

HI @FelixKrueged, I run the new version.

SNPsplit_genome_preparation --strain CAST_EiJ --reference_genome genome --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf

Two things :

Is it expected ? Otherwise, I do have all chromosomes as expected in the results folder. Cheers

FelixKrueger commented 4 years ago

Hmm, the N-masking seems to work fine if you specify --full_sequence as well, I'll take another look tomorrow.

FelixKrueger commented 4 years ago

Right, it was ... - a scoping issue. It should work now, could you try cloning the dev version and try again? Addressed here 0e4431e98645058c69a8503ddb0ce324b26b5b00.

nservant commented 4 years ago

Yes. Much better now !

Summary 20668547 Ns were newly introduced into the N-masked genome for strain CAST_EiJ in total

FelixKrueger commented 4 years ago

Awesome, I'll leave this open for a few more days to give you some time to test. It will then find its way into the next release.