TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

Pipeline dies at building database #152

Closed kguynes closed 2 weeks ago

kguynes commented 3 weeks ago

Hi,

Many thanks for creating a wonderful tool. Unfortunately, I have attempted to run this pipeline several times using the genomic fasta file I acquired from https://parasite.wormbase.org/Caenorhabditis_angaria_prjna51225/Info/Index but to no avail.

I'll copy the error messages here for clarity:

    <<< Cleaning Genome >>>
Traceback (most recent call last):
  File "/usr/local/share/earlgrey-5.0.0-0/scripts//faswap.py", line 11, in <module>
    a=dict(csv.reader(open(dictionary),delimiter="\t"))
  File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 3: invalid start byte

              )  (
         (   ) )
         ) ( (
       _______)_
    .-'---------|  
       ( C|/\/\/\/\/|
    '-./\/\/\/\/|
     '_________'
      '-------'
    <<< Detecting Novel Repeats >>>
Building database cAngaria:
  Reading /users/xx/00-Reference_genomes/00-C.angaria/caenorhabditis_angaria.PRJNA51225.WBPS19.genomic.fa.gz.prep...
Died at /usr/local/bin/BuildDatabase line 331.
The makeblastdb program exited with code 1.  Please check your input file(s) for potential formating errors.
/usr/local/bin/makeblastdb returned: 

I ran the RepeatModeler tool to build database and compute TE annotation to test if this is indeed an issue with the fasta file, but it seems to run without an issue.

Not sure how to mitigate the issue I'm running into with the EarlGrey tool. Any help/pointers would be greatly appreciated.

Thank you very much in advance!

TobyBaril commented 2 weeks ago

Hi,

This looks like there might be an issue with the FASTA headers that faswap.py can't deal with. In this particular case, it seems there are some non-utf8 characters being detected. What do the headers look like? are there any weird characters in them?

kguynes commented 2 weeks ago

Dear @TobyBaril,

Thank you for your response. I've copied an example of the header below. Could it be the space and the length parameter included in the header?

>Cang_2012_03_13_00002 length=637461
TAATATTGAATTATAGCTGTCCACGTATTTATGGCGCACCCTGTAATGTGGTGATTGCAA
TACCAAATGTTGAAAATACAGTATGATAGATTTGAAATGTCAAATCCATTGTAATTAGAA
GATTGAGAATGATAGATTGTGGGAAGAGCTTCCCCTTATTTAATATGTTATGATTTATGT
TCAAAAACAAAAAGTACAAAACTACAAAACGATATGATGTTCAGCCTTTTTTCCACCAAA
AATATATAATGCAAAAATTATAGCGGTATCAAGCTTAAGGTATTTGAATAAAAATCAGAA
CTGGTTGTGAAAAGTACTATTTCTGAGGGTGAGTACAAGAATCAAACAAGTCTATTTTTC
TGTAATTCTCTGGAGATTGTTAATTTAGATGCAAAATATTATTCTAAAAGAATTTCATTT
kguynes commented 2 weeks ago

As per your suggestion, I have amended the fasta header and the pipeline seems to run just fine. Will report if anything else goes wrong. Many thanks for your help!