Extract Error: "Sequence contains non-DNA character '*' at position 393"

jon4thin commented 3 months ago

I am trying to run g2gtools extract with a g2gtools DB created from a GTF and a FASTA. It is running successfully but after 14 minutes it stops:

2024-08-05T19:38:02.180415541Z GTTCCAATACTGTGTTGCAGTGGGAGCCCAAACTTTCCCCAGTGTGAGTGCTCCCAGCAAGAAAGTGGCAAAGCAGATGGCCGCAGAGGAAGCCATGAAGGCCCTGCATGGGGAGGCGACCAACTCCATGGCTTCTGATAACCAG 2024-08-05T19:38:02.180441323Z [g2gtools debug] Exon ID=ENSE00003837056.1_L;Exon: ENSE00003837056.1_L chr1_L:154596815-154597005 (-1) #6 2024-08-05T19:38:02.180595316Z [g2gtools debug] chr1_L:154596815-154597005 (Length: 191) 2024-08-05T19:38:02.180600730Z CCTGAAGGTATGATCTCAGAGTCACTTGATAACTTGGAATCCATGATGCCCAACAAGGTCAGGAAGATTGGCGAGCTCGTGAGATACCTGAACACCAACCCTGTGGGTGGCCTTTTGGAGTACGCCCGCTCCCATGGCTTTGCTGCTGAATTCAAGTTGGTCGACCAGTCCGGACCTCCTCACGAGCCCAA 2024-08-05T19:38:02.180627776Z [g2gtools debug] Exon ID=ENSE00003836678.2_L;Exon: ENSE00003836678.2_L chr1_L:154589767-154590419 (-1) #7 2024-08-05T19:38:02.183394207Z Traceback (most recent call last): 2024-08-05T19:38:02.185504398Z File "/opt/conda/envs/g2gtools/bin/g2gtools", line 4, in <module> 2024-08-05T19:38:02.186001359Z __import__('pkg_resources').run_script('g2gtools==0.2.9', 'g2gtools') 2024-08-05T19:38:02.186006933Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script 2024-08-05T19:38:02.187122669Z self.require(requires)[0].run_script(script_name, ns) 2024-08-05T19:38:02.187127761Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script 2024-08-05T19:38:02.187499526Z exec(code, namespace, namespace) 2024-08-05T19:38:02.187942044Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 132, in <module> 2024-08-05T19:38:02.187946940Z G2GToolsApp() 2024-08-05T19:38:02.187949734Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 99, in __init__ 2024-08-05T19:38:02.187994284Z getattr(self, args.command)() 2024-08-05T19:38:02.187999699Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 105, in extract 2024-08-05T19:38:02.188002813Z g2gtools.g2g_commands.command_fasta_extract(sys.argv[2:], self.script_name + ' extract') 2024-08-05T19:38:02.188005779Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_commands.py", line 422, in command_fasta_extract 2024-08-05T19:38:02.189323301Z fasta.fasta_extract_transcripts(fasta_file, args.database, None, raw=args.raw) 2024-08-05T19:38:02.189330035Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/fasta.py", line 562, in fasta_extract_transcripts 2024-08-05T19:38:02.190316627Z partial_seq_str = str(g2g_utils.reverse_complement_sequence(partial_seq)) 2024-08-05T19:38:02.190322077Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_utils.py", line 229, in reverse_complement_sequence 2024-08-05T19:38:02.190975659Z return reverse_sequence(complement_sequence(sequence)) 2024-08-05T19:38:02.190981606Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_utils.py", line 214, in complement_sequence 2024-08-05T19:38:02.191010963Z raise ValueError("Sequence contains non-DNA character '{0}' at position {1:n}\n".format(val[position], position + 1)) 2024-08-05T19:38:02.191575326Z ValueError: Sequence contains non-DNA character '*' at position 393

Is '*' supposed to denote a stop codon? but then again, I only have 35 in my entire FASTA....

mattjvincent commented 3 months ago

You seem to be using an older version. Python 2.7 and g2gtools 0.2.9. Can you update to Python 3 and g2gtools 2.0.0 and retry?

jon4thin commented 3 months ago

@mattjvincent thank you so much for your reply. I have been using the docker image: kbchoi/emase:latest , I see that this has not been updated.....

Ideally I would like to use a docker image, since I am running this on a velsera-based cloud platform. That is not avilable, correct?
I cannot even find g2g tools 2.0.0. The most recent I could find is 1.0.1 on the github: https://github.com/churchill-lab/g2gtools

mattjvincent commented 3 months ago

Installing from the main branch will install 2.0.0. I just created a release. Feel free to create a Docker image of your own.

There is one in the main branch in the top level directory.

mattjvincent commented 3 months ago

@jon4thin ...

Try:

docker run churchilllab/g2gtools:2.0.0 g2gtools --help

You will have to mount directories and such, but this should get you started.

jon4thin commented 3 months ago

I will give it a shot right now and update shortly! I have never made a Docker image before so I will have to see how far i can get. tysvm for getting me going!

jon4thin commented 3 months ago

Is the /opt/conda/envs/emase/bin/emase-zero command in kbchoi/emase:latest also outdated? I noticed that the flags in the documentation online and what I can do with the tool from the docker image is slightly different.

jon4thin commented 3 months ago

sorry for the delaychurchilllab/g2gtools:2.0.0 works and it works on the Velsera CWL-based cloud compute platform as well! Let me know if there is any updated documentation (i.e. change in the command flags) and let me know if kbchoi/emase:latest also outdated! Thanks so much for your support <3~

mattjvincent commented 3 months ago

EMASE's codebase was folded into GBRS. However, emase-zero has remain unchanged. There have been some minor flag changes since upgrading gbrs to Python 3, but we will have to compile a list.

jon4thin commented 3 months ago

Hey @mattjvincent , sorry to bother again. I updated my pipeline to use churchilllab/g2gtools:2.0.0 and while I received no issues with the * in the patch or transform or even gtf2db steps, g2g extract returned an error:

ValueError: Sequence contains non-DNA character '*' at position 393

I have the full error if needed.

Unless I am missing something, I had a conversation with a GATK person, and it seems like one can just grep/awk remove these records anyway:

Remove the * ALT Allele Reporting From a g.vcf Before (or After) Genotyping

It is just a little bit frustrating because the g2gtools -> alntools -> emase-zero pipeline requires 3 phases of manual removal of records:

All HLA contigs from the first input VCF and FASTA
All * alleles from the VCI
All ENST's that are _PAR_Y and the original ENST (these throw an error in the emase-zero step) from the aligned SAM and the gene2transcript map.

This would be no problem if it was mentioned in a manual or something. It just takes a long time to figure out if you dont know before hand.

mattjvincent commented 3 months ago

Interesting. The reason it fails is because g2gtools only uses the following bases/characters (upper or lower can be used):

ATGCYRSWKMBDHVN

I understand that * are stop codons. Might have to rethink how to handle this.

jon4thin commented 3 months ago

The * that appear in .g.vcfs and vcfs denote the genotype at a SNP site when there is a upstream spanning deletion. These appear when there is a SNP somewhere but there is also, on the other allele or another patient, a deletion spanning that reason. Therefore, * is used to call the genotype at the SNP location. In this way, you can call a "genotype" at that position while referencing the fact that there is an upstream deletion. Therefore if you are constructing a FASTA sequence, you want to leave out the * records because that information is already contained in the upstream deletion.

There should be no stop codons denoted as * in g.vcf or vcf or fasta because these are DNA, not protein, sequences.

churchill-lab / g2gtools

Extract Error: "Sequence contains non-DNA character '*' at position 393" #30