Error when trying to run 01_make_probeseq.py

I have successfully installed the requirements to run this, so that I can try and remap the dbsnp loci. However when I run this command:

$ python 
/home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/MNEc2to3/scripts/01_make_probeseq.py --vcf GCA_000002305.1.chrs.sorted.vcf --fasta /home/mccuem/shared/FASTAs/Equus_cab_nucl_wChrUn1_2.fasta --out dbsnp_probes.fa

I get this error:

Traceback (most recent call last):
  File "/home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/MNEc2to3/scripts/01_make_probeseq.py", line 48, in <module>
    print_flank_sequence(args.vcf,args.fasta,args.out)
  File "/home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/MNEc2to3/scripts/01_make_probeseq.py", line 20, in print_flank_sequence
    if fasta[var.chrom][var.pos] != var.ref:
  File "/home/mccuem/durwa004/.conda/envs/lp_sda/lib/python3.6/site-packages/locuspocus/fasta.py", line 223, in __getitem__
    raise ValueError(f'{chrom_name} not in {self.m80.name}')
ValueError: chr1 not in temp

I'm not sure where I'm going wrong, I have renamed the chromosomes in the dbsnp file so that they match the fasta and even tried sorting the vcf to be in the same order as the fasta and am still getting the same error. Not sure where I am going wrong!

There were some pieces of code that were commented out on accident. I reverted that changes to what they should be. Can you try pulling the changes from GitHub or re-downloading the scripts and trying again?

It seems to be working now! Thank you

On Thu, Jan 16, 2020 at 4:37 PM Rob Schaefer notifications@github.com wrote:

There were some pieces of code that were commented out on accident. I reverted that changes to what they should be. Can you try pulling the changes from GitHub or re-downloading the scripts and trying again?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UMN-EGGL/MNEc2to3/issues/4?email_source=notifications&email_token=ADY2HS2AQ7BOOYSQIUCWR5LQ6DOS5A5CNFSM4KHNLJ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJFZQ5Q#issuecomment-575379574, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY2HS7KTU2JBPKZFUYAUJ3Q6DOS5ANCNFSM4KHNLJ5Q .

-- Sian Durward-Akhurst, BVMS, MS, Diplomate ACVIM, MRCVS PhD Candidate in Equine Genetics

You know where to go once it breaks :D

So true!!

Sadly I think it has. I tried running it on my laptop to avoid the MSI limitations and it must have run for ~1.5 hours and then I get this error:


Traceback (most recent call last):

  File "../MNEc2to3/scripts/01_make_probeseq.py", line 48, in <module>

    print_flank_sequence(args.vcf,args.fasta,args.out)

  File "../MNEc2to3/scripts/01_make_probeseq.py", line 20, in
print_flank_sequence

    if fasta[var.chrom][var.pos] != var.ref:

  File
"/Users/durwa004/Library/Python/3.7/lib/python/site-packages/locuspocus/fasta.py",
line 223, in __getitem__

    raise ValueError(f'{chrom_name} not in {self.m80.name}')

ValueError: chr1 not in temp```

My computer was struggling to run it and so I reinstalled the environment I
created on msi and everything seemed to install successfully but I get this
error:

/home/mccuem/durwa004/.conda/envs/lp_sda/bin/python3 /home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/MNEc2to3/scripts/01_make_probeseq.py --vcf GCA_000002305.1.chrs.vcf.gz --fasta GCA_000002305.1_EquCab2.0_genomic.fna.gz --out dbsnp_probes.fa

Traceback (most recent call last):

File "/home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/MNEc2to3/scripts/01_make_probeseq.py", line 48, in

print_flank_sequence(args.vcf,args.fasta,args.out)

File "/home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/MNEc2to3/scripts/01_make_probeseq.py", line 20, in print_flank_sequence

if fasta[var.chrom][var.pos] != var.ref:

File "/home/mccuem/durwa004/.local/lib/python3.7/site-packages/locuspocus/fasta.py", line 223, in getitem

raise ValueError(f'{chrom_name} not in {self.m80.name}')

ValueError: CM000377.2 not in temp



This is with the right version of the EquCab2 genome (downloaded from NCBI)
and the chromosome names are the same between the vcf and fasta as far as I
can tell.

Sorry to be a pain.

On Fri, Jan 17, 2020 at 7:54 AM Rob Schaefer <notifications@github.com>
wrote:

> You know where to go once it breaks :D
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/UMN-EGGL/MNEc2to3/issues/4?email_source=notifications&email_token=ADY2HS3UIF3TB7JEMETUI23Q6G2B7A5CNFSM4KHNLJ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJHX34Q#issuecomment-575634930>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ADY2HS5U4MT5KB6QU6QYF2LQ6G2B7ANCNFSM4KHNLJ5Q>
> .
>

-- 
Sian Durward-Akhurst, BVMS, MS, Diplomate ACVIM, MRCVS
PhD Candidate in Equine Genetics

University of Minnesota Equine Genetics and Genomics Laboratory
225 Veterinary Population Medicine
1365 Gortner Avenue
Saint Paul, MN
55108

durwa004@umn.edu

Pronouns: she/her/hers
http://d.umn.edu/sexuality-gender-equity-initiatives/education-advocacy/pronouns

There appears to be a SNP in your VCF assigned to chromosome CM000377.2 which isn't in the reference genome. Here are all the chromosome names I could find on MSI:

[/home/mccuem/shared/FASTAs] % cat Equus_cab_nucl_wChrUn1_2.fasta | grep ">"
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr1
>chr20
>chr21
>chr22
>chr23
>chr24
>chr25
>chr26
>chr27
>chr28
>chr29
>chr2
>chr30
>chr31
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chrX
>chrUn
>chrUn2

So either we need to add those chromosomes to the fasta or remove them from the VCF file. Otherwise, we can add a chunk of code to catch those errors and skip those SNPs (which I think would be easiest). Thoughts?

That makes sense, but I am not sure that is it. I pulled the same reference as the one that the dbsnp SNPs are from (GCA_000002305.1) and made sure the chromosome names matched

This is the top of the dbsnp vcf file:


##fileformat=VCFv4.2

##FILTER=<ID=PASS,Description="All filters passed">

##INFO=<ID=ALMM,Number=0,Type=Flag,Description="Alleles mismatch flag,
present when some of the submitted variants have inconsistent allele
information. See https://github.com/EBI

##INFO=<ID=ASMM,Number=0,Type=Flag,Description="Assembly mismatch flag,
present when the reference allele doesn't match the reference sequence">

##INFO=<ID=LOE,Number=0,Type=Flag,Description="Lack of evidence flag,
present if no submitted variant includes genotype or frequency information">

##INFO=<ID=RS_VALIDATED,Number=0,Type=Flag,Description="RS validated flag,
present when the RS was validated by any method as indicated by the dbSNP
validation status">

##INFO=<ID=SID,Number=.,Type=String,Description="Identifiers of studies
that report a variant">

##INFO=<ID=SS_VALIDATED,Number=1,Type=Integer,Description="Number of
submitted variants clustered in an RS that were validated by any method as
indicated by the dbSNP validation s

##INFO=<ID=VC,Number=1,Type=String,Description="Variant class according to
the Sequence Ontology">

##contig=<ID=CM000386.2,accession="CM000386.2">

##contig=<ID=CM000387.2,accession="CM000387.2">

##contig=<ID=CM000388.2,accession="CM000388.2">

##contig=<ID=CM000389.2,accession="CM000389.2">

##contig=<ID=CM000390.2,accession="CM000390.2">

##contig=<ID=CM000391.2,accession="CM000391.2">

##contig=<ID=CM000392.2,accession="CM000392.2">

##contig=<ID=CM000393.2,accession="CM000393.2">

##contig=<ID=CM000394.2,accession="CM000394.2">

##contig=<ID=CM000395.2,accession="CM000395.2">

##contig=<ID=CM000377.2,accession="CM000377.2">

##contig=<ID=CM000396.2,accession="CM000396.2">

##contig=<ID=CM000397.2,accession="CM000397.2">

##contig=<ID=CM000398.2,accession="CM000398.2">

##contig=<ID=CM000399.2,accession="CM000399.2">

##contig=<ID=CM000400.2,accession="CM000400.2">

##contig=<ID=CM000401.2,accession="CM000401.2">

##contig=<ID=CM000402.2,accession="CM000402.2">

##contig=<ID=CM000403.2,accession="CM000403.2">

##contig=<ID=CM000404.2,accession="CM000404.2">

##contig=<ID=CM000405.2,accession="CM000405.2">

##contig=<ID=CM000378.2,accession="CM000378.2">

##contig=<ID=CM000406.2,accession="CM000406.2">

##contig=<ID=CM000407.2,accession="CM000407.2">

##contig=<ID=CM000379.2,accession="CM000379.2">

##contig=<ID=CM000380.2,accession="CM000380.2">

##contig=<ID=CM000381.2,accession="CM000381.2">

##contig=<ID=CM000382.2,accession="CM000382.2">

##contig=<ID=CM000383.2,accession="CM000383.2">

##contig=<ID=CM000384.2,accession="CM000384.2">

##contig=<ID=CM000385.2,accession="CM000385.2">

##contig=<ID=NC_001640.1,accession="X79547.1">

##contig=<ID=CM000408.2,accession="CM000408.2">

##reference=<ID=EquCab2.0,accession=GCA_000002305.1,URL=
https://www.ebi.ac.uk/ena/data/view/GCA_000002305.1>

##bcftools_annotateVersion=1.2+htslib-1.2.1

##bcftools_annotateCommand=annotate --rename-chrs
rename_dbsnp_chrs_with_chr.txt GCA_000002305.1_current_ids.vcf.gz

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

CM000377.2      425     rs396553607     A       G       .       .
LOE;SID=DINDOTLAB_QH_SNPS;SS_VALIDATED=0;VC=SO:0001483

CM000377.2      551     rs394481920     T       G       .       .
LOE;SID=EGGL_MNEC_23M,DINDOTLAB_QH_SNPS;SS_VALIDATED=0;VC=SO:0001483

CM000377.2      561     rs395435164     G       C       .       .
LOE;SID=EGGL_MNEC_23M,DINDOTLAB_QH_SNPS;SS_VALIDATED=0;VC=SO:0001483

These are the fasta file chromosome names


cat GCA_000002305.1_EquCab2.0_genomic_renamed.fna | grep ">C"

>CM000377.2

>CM000378.2

>CM000379.2

>CM000380.2

>CM000381.2

>CM000382.2

>CM000383.2

>CM000384.2

>CM000385.2

>CM000386.2

>CM000387.2

>CM000388.2

>CM000389.2

>CM000390.2

>CM000391.2

>CM000392.2

>CM000393.2

>CM000394.2

>CM000395.2

>CM000396.2

>CM000397.2

>CM000398.2

>CM000399.2

>CM000400.2

>CM000401.2

>CM000402.2

>CM000403.2

>CM000404.2

>CM000405.2

>CM000406.2

>CM000407.2

>CM000408.2

On Fri, Jan 17, 2020 at 12:15 PM Rob Schaefer notifications@github.com wrote:

There appears to be a SNP in your VCF assigned to chromosome CM000377.2 which isn't in the reference genome. Here are all the chromosome names I could find on MSI:

[/home/mccuem/shared/FASTAs] % cat Equus_cab_nucl_wChrUn1_2.fasta | grep ">"

chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr1 chr20 chr21 chr22 chr23 chr24 chr25 chr26 chr27 chr28 chr29 chr2 chr30 chr31 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrX chrUn chrUn2

So either we need to add those chromosomes to the fasta or remove them from the VCF file. Otherwise, we can add a chunk of code to catch those errors and skip those SNPs (which I think would be easiest). Thoughts?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/UMN-EGGL/MNEc2to3/issues/4?email_source=notifications&email_token=ADY2HS7X2C2OSYAXITT6NGDQ6HYVRA5CNFSM4KHNLJ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIQ2IQ#issuecomment-575737122, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY2HSZAJVY46FZVEYYPMILQ6HYVRANCNFSM4KHNLJ5Q .

-- Sian Durward-Akhurst, BVMS, MS, Diplomate ACVIM, MRCVS PhD Candidate in Equine Genetics

University of Minnesota Equine Genetics and Genomics Laboratory 225 Veterinary Population Medicine 1365 Gortner Avenue Saint Paul, MN 55108

durwa004@umn.edu

Pronouns: she/her/hers http://d.umn.edu/sexuality-gender-equity-initiatives/education-advocacy/pronouns

UMN-EGGL / MNEc2to3

Error when trying to run 01_make_probeseq.py #4