Closed eltonjrv closed 7 months ago
Hi @eltonjrv,
Thanks for your interest! Could you please share your MAF file, or the beginning of it, in order to test this on our end? Also, please make sure that only the MAF file is in your ./input/
folder.
Best,
Marcos
Many thanks for your prompt reply, Marcos. Yes, the problem was due to having other files rather than the MAF ones only within the ./input dir.
It ran partially well now (with several output files generated), however almost half of my mutations were skipped and got the following Error message:
> cosmic_fit(samples="./run01/", output="./output01/", input_type='vcf', context_type="96", collapse_to_SBS96=TRUE, cosmic_version=3.3, exome=FALSE, genome_build="GRCh38")
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 16.35 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 61.06 seconds.
Matrices generated for 13 samples with 467216 errors. Total of 173331 SNVs, 3965 DINUCs, and 174450 INDELs were successfully analyzed.
Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
ValueError: Error: More than 30% of mutations were skipped. Please check the log file for more information.
Run reticulate::py_last_error()
for details.
Two questions: 1) I wonder whether this error blocked a full execution of the tool (not producing all outputs), or it's alright to rely on the outputs for the ones successfully analyzed. 2) As of now I don't have strand information for my mutations (will need to massage my maf files in order to put that info in), so I ended up just filling out the strand column with a dot for all annotated mutations. I wonder whether this would be the main issue for skipping half of my mutations.
Here are my 2 MAFs: ==> CL-noRSIDs-noGnomAD_opt2-input4SigProfAssign.maf <== Hugo Entrez Center Genome Chrom Start End Strand Classification Type Ref Alt1 Alt2 dbSNP SNP_Val_status Tumor_sample WASH7P NR_024540 . GRCh38 chr1 19432 19435 . RNA DEL ATGG . A . . TTC466 WASH7P NR_024540 . GRCh38 chr1 20094 20096 . RNA DEL TAA . T . . SKES_1 WASH7P NR_024540 . GRCh38 chr1 20094 20096 . RNA DEL TAA . T . . SKNMC
==> PD-noRSIDs-noGnomAD_opt2-input4SigProfAssign.maf <== Hugo Entrez Center Genome Chrom Start End Strand Classification Type Ref Alt1 Alt2 dbSNP SNP_Val_status Tumor_sample WASH7P NR_024540 . GRCh38 chr1 19190 19191 . RNA DEL GC . G . . ES_2915 WASH7P NR_024540 . GRCh38 chr1 19432 19435 . RNA DEL ATGG . A . . ES_5366_02 WASH7P NR_024540 . GRCh38 chr1 20094 20096 . RNA DEL TAA . T . . ES_14465_02
Looking forward to hearing back from you, Thanks, Elton
Hi @eltonjrv,
The error is thrown due to an abnormally large number of mutations that are not found in the reference genome. Often in these cases, the wrong reference genome is being used. Could you please confirm that your data is in fact GRCh38?
Additionally, you can reference an example maf on our wiki documentation.
Thanks!
Hi Marcos,
Here is the issue: These data come from a third-party facility service by Novogene, performed in 2022. I've double checked their methods and it's only informing they've used BWA against hg38 reference genome, GATK HaplotypeCaller and VariantFiltration modules, and ANNOVAR for annotation. The ANNOVAR output files are labeled with "hg38" on their names, and I converted them to MAF with the 'annovarToMAF' function from MAFtools, then I wrote an ad-hoc code to simplify my MAFs like the one from your 'example maf' recommendation. Would the fact that the analysis was performed two years ago, perhaps with an outdated GRCh38 ref genome, be the main issue?
Thanks again, Elton
Hi @eltonjrv,
Could you please send 10 of each:
Thanks!
Thanks for your reply. Please find the files attached. 10IDs-toDevel.maf.txt 10SNPs-toDevel.maf.txt
One thing that I noticed through my logs/SigProfilerMatrixGenerator_Input_vcffiles_GRCh382024-03-10.out file is that, for the skipped mutations, it's reporting a wrong position (subtracted by 1) compared to my actual input maf file.
See the following 3 first skipped mutation lines from the log file: ##### The reference base does not match the reference chromosome position. Skipping this mutation: 11 589133 C CA The reference base does not match the reference chromosome position. Skipping this mutation: 11 650372 CAA C The reference base does not match the reference chromosome position. Skipping this mutation: 11 669926 CAAAAAA C #####
Whereas in my input MAF the start position is correct: ##### \$grep -P 'chr11\t589134\t' CL-noRSIDs-noGnomAD_opt2-input4SigProfAssign.maf PHRF1 NM_001286581,NM_001286582,NM_001286583,NM_020901 . GRCh38 chr11 589134 589134 . Intron INS C . CA . . A673 PHRF1 NM_001286581,NM_001286582,NM_001286583,NM_020901 . GRCh38 chr11 589134 589134 . Intron INS C . CA . . SKES_1
\$grep -P 'chr11\t650373\t' CL-noRSIDs-noGnomAD_opt2-input4SigProfAssign.maf DEAF1 NM_001293634,NM_021008 . GRCh38 chr11 650373 650375 . Intron DEL CAA . C . . A673
\$grep -P 'chr11\t669927\t' CL-noRSIDs-noGnomAD_opt2-input4SigProfAssign.maf DEAF1 NM_001293634,NM_021008 . GRCh38 chr11 669927 669933 . Intron DEL CAAAAAA . C . . A673 DEAF1 NM_001293634,NM_021008 . GRCh38 chr11 669927 669933 . Intron DEL CAAAAAA . C . . SKES_1 DEAF1 NM_001293634,NM_021008 . GRCh38 chr11 669927 669933 . Intron DEL CAAAAAA . C . . TTC466 #####
I've been browsing over 20 of my MAF genomic coordinates through UCSC GRCh38 and could confirm that it's showing perfect match in all cases.
I hope we can get a solution for this.
Thanks again, Elton
Hi @eltonjrv,
It looks like you are providing input that is RNA data and suspect that this is causing the issue. We use the reference genome to identify the context, so RNA data does not work. Please let me know if this resolves your issue.
Hello there,
No, this is indeed WGS (DNA-seq) data, not RNA-seq data. The fact that the "Classification" column says 'RNA' is just because the variant falls onto a non-coding RNA gene. Was this the reason why you thought this is RNA data? If not, could you please specify/clarify on why you think this is RNA data?
Thanks, Elton
Hi back again,
Just to let you know that, after converting my MAF files to vcf, the tool has run well without any errors.
################
> cosmic_fit(samples="./run03/", output="./run03/output/", input_type='vcf', context_type="96", collapse_to_SBS96=TRUE, cosmic_version=3.3, exome=FALSE, genome_build="GRCh38")
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 13.93 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 187.06 seconds.
Matrices generated for 13 samples with 0 errors. Total of 173331 SNVs, 3965 DINUCs, and 641666 INDELs were successfully analyzed.
Assigning COSMIC sigs or Signature Database ......
|████████████████████████████████████████| 13/13 [100%] in 24.6s (0.54/s)
Your Job Is Successfully Completed! Thank You For Using SigProfilerAssignment. ################
Thanks anyways for all your attention and support on this open issue, Best, Elton
Great, glad that you were able to resolve the issue with the indexing. Please let us know if you encounter any other issues.
Dear Marcos (or any other SigProfilerAssignmentR developer),
I'm coming across the following error message even after adjusting my MAF file according to the example you pointed me to at https://osf.io/dkjwr .
> cosmic_fit(samples="./input/", output="./output/", input_type='vcf', context_type="96", collapse_to_SBS96=TRUE, cosmic_version=3.3, exome=FALSE, genome_build="GRCh38")
File format not supported Error in py_call_impl(callable, call_args$unnamed, call_args$named) : UnboundLocalError: local variable 'samples' referenced before assignment Run
reticulate::py_last_error()
for details.Any clue on what might be happening and how to fix?
Thanks, Elton