error in returning ID catergories with hg19 version

jh2663 commented 4 years ago

Hi, just another issue with handling indel categories.

I have a FASTA file locally (out_file) and run this command to get ID categories:

palimpdir<-'/Users/hongj/Library/R/3.6/library/Palimpsest/' mut_data_vcf<-annotate_VCF(vcf = mut_data_selected.ordered, genome_build = "hg19", ref_genome = ref_genome, ref_fasta = out_file, palimpdir= palimpdir, add_ID_cats=TRUE, add_DBS_cats=FALSE)

SBS annotation works perfectly but not for ID. It shows following errors:

================================================================== [1] Runnng PCAWG7-data-preparation-version-1.5 in python to extract Indel categories.. ('failed while annotating', '/Users/hongj/Library/R/3.6/library/Palimpsest/Temporary//python_vcf_indel.simple') Traceback (most recent call last): File "/Users/hongj/Library/R/3.6/library/Palimpsest/exec/make_spectra_indels.py", line 123, in file_features_plus, file_counts_plus = lookups_indels.get_indel_class_counts(filename, input_type, '+', variant_filters=filters) File "/Users/hongj/Library/R/3.6/library/Palimpsest/exec/lookups_indels.py", line 757, in get_indel_class_counts annotate_simple_indel(input_filename, annotated_cache_file) File "/Users/hongj/Library/R/3.6/library/Palimpsest/exec/lookups_indels.py", line 328, in annotate_simple_indel LHS_seq = get_reference_seq(ref_genome, test_chromosome, start_pos-5*del_length, start_pos-1) File "/Users/hongj/Library/R/3.6/library/Palimpsest/exec/lookups_indels.py", line 85, in get_reference_seq sequence = _grch37_reference.get_sequence(chromosome, start, end) AttributeError: 'NoneType' object has no attribute 'get_sequence' Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input In addition: Warning message: In add_ID_cats_ToVCF(vcf = vcf, ref_fasta = ref_fasta, palimpdir_man = palimpdir, :

Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input

================================================================== Any idea on this?

best,

jh2663 commented 4 years ago

UPDATE: When I give a try with hg38 version, I find it seems working with following Warning message only:

Warning message: In add_ID_cats_ToVCF(vcf = vcf, ref_fasta = ref_fasta, palimpdir_man = palimpdir, : Indel category extraction with PCAWG7-data-preparation-version-1.5 python script is finished (if there are error messages above it has not been successful)

So it should be 'hg19' version issue?

thanks.

jh2663 commented 4 years ago

UPDATE2:

OK. I've figured out what happened. Actually my data was based on 'hs37d5' genome version not precisely 'hg19'. So I downloaded 'BSgenome.Hsapiens.1000genomes.hs37d5' and gave a try this:

================================================================= mut_data_vcf<-annotate_VCF(vcf = mut_data_selected.ordered, genome_build = "hg19", ref_genome = ref_genome, ref_fasta = ref_fasta, palimpdir= palimpdir, add_strand_and_SBS_cats = FALSE, add_ID_cats=TRUE, add_DBS_cats=FALSE)

[1] Runnng PCAWG7-data-preparation-version-1.5 in python to extract Indel categories.. Warning message: In add_ID_cats_ToVCF(vcf = vcf, ref_fasta = ref_fasta, palimpdir_man = palimpdir, : Indel category extraction with PCAWG7-data-preparation-version-1.5 python script is finished (if there are error messages above it has not been successful)

================================================================= Now it seems worked. Actually, 'hg19' and 'hs38d5' differ in mitochondrial DNA coordinates but the option 'genome_build' allows only either 'hg19' or 'hg38'. This is a bit confusing.

Please let me know if anything else should be considered.

best,

jh2663 commented 4 years ago

UPDATE3:

sorry for long posting. I got another error using with hs37d5 version. Now SBS annotation is not working possibly because hg19 version has 'chr' tag but not in hs37d5.

================================================================= [1] Adding gene, strand and SBS category annotations .. Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence chr1 not found In addition: Warning message: In .Seqinfo.mergexy(x, y) :

Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width, : sequence chr1 not found

===================================================================

Finally, I was able to resolve all of above issues simply by using 'BSgenome.Hsapiens.UCSC.hg19' as 'ref_genome' but 'hs37d5.fasta' as 'ref_fasta'. Could you fix this genome version related issue in the next update?

best,

FunGeST commented 4 years ago

Hi,

Thank you very much for pointing out this issue, we'll look into it and fix it for the next update!

best wishes. Benedict

FunGeST / Palimpsest

error in returning ID catergories with hg19 version #33