Open jon4thin opened 3 months ago
You seem to be using an older version. Python 2.7 and g2gtools 0.2.9. Can you update to Python 3 and g2gtools 2.0.0 and retry?
@mattjvincent thank you so much for your reply. I have been using the docker image: kbchoi/emase:latest
, I see that this has not been updated.....
Installing from the main branch will install 2.0.0. I just created a release. Feel free to create a Docker image of your own.
There is one in the main branch in the top level directory.
@jon4thin ...
Try:
docker run churchilllab/g2gtools:2.0.0 g2gtools --help
You will have to mount directories and such, but this should get you started.
I will give it a shot right now and update shortly! I have never made a Docker image before so I will have to see how far i can get. tysvm for getting me going!
Is the /opt/conda/envs/emase/bin/emase-zero
command in kbchoi/emase:latest
also outdated? I noticed that the flags in the documentation online and what I can do with the tool from the docker image is slightly different.
sorry for the delaychurchilllab/g2gtools:2.0.0
works and it works on the Velsera CWL-based cloud compute platform as well! Let me know if there is any updated documentation (i.e. change in the command flags) and let me know if kbchoi/emase:latest also outdated! Thanks so much for your support <3~
EMASE's codebase was folded into GBRS. However, emase-zero has remain unchanged. There have been some minor flag changes since upgrading gbrs to Python 3, but we will have to compile a list.
Hey @mattjvincent , sorry to bother again. I updated my pipeline to use churchilllab/g2gtools:2.0.0
and while I received no issues with the *
in the patch
or transform
or even gtf2db
steps, g2g extract
returned an error:
ValueError: Sequence contains non-DNA character '*' at position 393
I have the full error if needed.
Unless I am missing something, I had a conversation with a GATK person, and it seems like one can just grep/awk remove these records anyway:
Remove the *
ALT Allele Reporting From a g.vcf Before (or After) Genotyping
It is just a little bit frustrating because the g2gtools -> alntools -> emase-zero pipeline requires 3 phases of manual removal of records:
*
alleles from the VCIThis would be no problem if it was mentioned in a manual or something. It just takes a long time to figure out if you dont know before hand.
Interesting. The reason it fails is because g2gtools only uses the following bases/characters (upper or lower can be used):
ATGCYRSWKMBDHVN
I understand that *
are stop codons. Might have to rethink how to handle this.
The *
that appear in .g.vcfs and vcfs denote the genotype at a SNP site when there is a upstream spanning deletion. These appear when there is a SNP somewhere but there is also, on the other allele or another patient, a deletion spanning that reason. Therefore, *
is used to call the genotype at the SNP location. In this way, you can call a "genotype" at that position while referencing the fact that there is an upstream deletion. Therefore if you are constructing a FASTA sequence, you want to leave out the *
records because that information is already contained in the upstream deletion.
There should be no stop codons denoted as *
in g.vcf or vcf or fasta because these are DNA, not protein, sequences.
I am trying to run
g2gtools extract
with a g2gtools DB created from a GTF and a FASTA. It is running successfully but after 14 minutes it stops:2024-08-05T19:38:02.180415541Z GTTCCAATACTGTGTTGCAGTGGGAGCCCAAACTTTCCCCAGTGTGAGTGCTCCCAGCAAGAAAGTGGCAAAGCAGATGGCCGCAGAGGAAGCCATGAAGGCCCTGCATGGGGAGGCGACCAACTCCATGGCTTCTGATAACCAG 2024-08-05T19:38:02.180441323Z [g2gtools debug] Exon ID=ENSE00003837056.1_L;Exon: ENSE00003837056.1_L chr1_L:154596815-154597005 (-1) #6 2024-08-05T19:38:02.180595316Z [g2gtools debug] chr1_L:154596815-154597005 (Length: 191) 2024-08-05T19:38:02.180600730Z CCTGAAGGTATGATCTCAGAGTCACTTGATAACTTGGAATCCATGATGCCCAACAAGGTCAGGAAGATTGGCGAGCTCGTGAGATACCTGAACACCAACCCTGTGGGTGGCCTTTTGGAGTACGCCCGCTCCCATGGCTTTGCTGCTGAATTCAAGTTGGTCGACCAGTCCGGACCTCCTCACGAGCCCAA 2024-08-05T19:38:02.180627776Z [g2gtools debug] Exon ID=ENSE00003836678.2_L;Exon: ENSE00003836678.2_L chr1_L:154589767-154590419 (-1) #7 2024-08-05T19:38:02.183394207Z Traceback (most recent call last): 2024-08-05T19:38:02.185504398Z File "/opt/conda/envs/g2gtools/bin/g2gtools", line 4, in <module> 2024-08-05T19:38:02.186001359Z __import__('pkg_resources').run_script('g2gtools==0.2.9', 'g2gtools') 2024-08-05T19:38:02.186006933Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script 2024-08-05T19:38:02.187122669Z self.require(requires)[0].run_script(script_name, ns) 2024-08-05T19:38:02.187127761Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script 2024-08-05T19:38:02.187499526Z exec(code, namespace, namespace) 2024-08-05T19:38:02.187942044Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 132, in <module> 2024-08-05T19:38:02.187946940Z G2GToolsApp() 2024-08-05T19:38:02.187949734Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 99, in __init__ 2024-08-05T19:38:02.187994284Z getattr(self, args.command)() 2024-08-05T19:38:02.187999699Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/EGG-INFO/scripts/g2gtools", line 105, in extract 2024-08-05T19:38:02.188002813Z g2gtools.g2g_commands.command_fasta_extract(sys.argv[2:], self.script_name + ' extract') 2024-08-05T19:38:02.188005779Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_commands.py", line 422, in command_fasta_extract 2024-08-05T19:38:02.189323301Z fasta.fasta_extract_transcripts(fasta_file, args.database, None, raw=args.raw) 2024-08-05T19:38:02.189330035Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/fasta.py", line 562, in fasta_extract_transcripts 2024-08-05T19:38:02.190316627Z partial_seq_str = str(g2g_utils.reverse_complement_sequence(partial_seq)) 2024-08-05T19:38:02.190322077Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_utils.py", line 229, in reverse_complement_sequence 2024-08-05T19:38:02.190975659Z return reverse_sequence(complement_sequence(sequence)) 2024-08-05T19:38:02.190981606Z File "/opt/conda/envs/g2gtools/lib/python2.7/site-packages/g2gtools-0.2.9-py2.7.egg/g2gtools/g2g_utils.py", line 214, in complement_sequence 2024-08-05T19:38:02.191010963Z raise ValueError("Sequence contains non-DNA character '{0}' at position {1:n}\n".format(val[position], position + 1)) 2024-08-05T19:38:02.191575326Z ValueError: Sequence contains non-DNA character '*' at position 393
Is '*' supposed to denote a stop codon? but then again, I only have 35 in my entire FASTA....