OliveiraDS-hub / ChimeraTE

A pipeline to detect chimeric transcripts derived from genes and transposable elements.
GNU General Public License v3.0
18 stars 4 forks source link

Mode_2 trouble #10

Closed JBerthelier closed 6 months ago

JBerthelier commented 9 months ago

Dear Daniel, Congrat for you nice work!

I am interested to use the mode_2 of ChimeraTE, but the script unfortunatly get trouble at the assembly step. Do you have an idea of the problem?

I followed you installation tuto, converted the transcriptome assembly into accepted format and used this command line:

python3 chimTE_mode2.py --input "input_mode2.tsv" --project projectA --te "TEseq.fa" --ref_TEs arabidopsis --transcripts "transcriptome.fa" --strand rf-stranded --threads 110 --assembly --ram 500

Thanks!

Below is the error message:

... [Tuesday 19/9/2023 - 12h:23] Converting alignment to bed... Done! [Tuesday 19/9/2023 - 12h:27] Masking transcripts with "TE" database... Done! [Tuesday 19/9/2023 - 12h:29] Identifying chimeric reads... Done! Traceback (most recent call last): File "chimTE_mode2.py", line 195, in transcriptome_assembly() File "scripts/mode2_assembly.py", line 99, in transcriptome_assembly overlapped[['read','mate']] = overlapped.read_IDs.str.split("/",expand=True) File "/home/j/jeremy-berthelier/miniconda/envs/chimeraTE/lib/python3.6/site-packages/pandas/core/frame.py", line 3041, in setitem self._setitem_array(key, value) File "/home/j/jeremy-berthelier/miniconda/envs/chimeraTE/lib/python3.6/site-packages/pandas/core/frame.py", line 3067, in _setitem_array raise ValueError("Columns must be same length as key") ValueError: Columns must be same length as key srun: error: deigo011112: task 0: Exited with exit code 1

JBerthelier commented 9 months ago

Not sure if it can help. but I used the file Arabidopsis_thaliana.TAIR10.cdna.all.fa from ESEMBL that I converted with your util bash script.

looks ok I guess:

" >AT1G76820.1_AT1G76820 TTAGAGTTGAGAAATTAGATTAAAAAAAGGATTAGATTAGTGATGAAGTAAGAGTTTTTG CTCTCTCTCAAACGGCTCGCTTTGATTACTCACGTAAGCCGGAAAGAATTTGGAACCGGC " ...

Best,

OliveiraDS-hub commented 9 months ago

Dear Jérémy, it will be a pleasure to help you.

Looks like you have empty files that are not supposed to be empty.

I'm a bit confused about your command-line. This second step in your log file "Masking transcripts with "TE" database..." only works when you activate the --assemblyoption, but you didn't.

Let's treat you error as if you are doing the transcriptome assembly with --assemblyoption.

It looks like you have provided a custom TE library in fasta format with --ref_TEs because you have the message "Masking transcripts with "TE" database...". It makes sense to do that, because by default you would mask Drosophila consensus in your arabidopsis transcripts.

Is "TE" a fasta file with TE consensus and their respective RepeatMasker ID? like >copia#LTR/Copia

If so, could you please make sure that this file has .fa extension? Like TE.fa

Additionally, could you check if you have two files at projects/projectA/rep1/trinity_out Trinity.fasta.out and trinity_TEs.bed

If they exist, are they empty?

Last question: Is there no > at the beginning of your transcript IDs, such as "AT1G76820.1_AT1G76820"?

Best

JBerthelier commented 9 months ago

Dear Daniel,

Thanks for your prompt reply,

-Yes I used the --assembly option (it is hide at the end of the command)

-Yes, after double checking I used the option '--ref_TEs arabidopsis' (I correct my first message), I do not provide any custome consensus, I am supposing it gonna run RepeatMasker with its own A. thaliana TE db, am I correct?

-Yes, my bad the transcript IDs has the '>' (I correct my previous message)

-The trinity_TEs.bed is embty and the Trinity.fasta.out has just two detections which seems no TE seq:


SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID

232 20.3 5.1 0.0 TRINITY_DN8492_c0_g1_i1 280 338 (1266) C IS3 ARTEFACT (1161) 97 36 1 232 20.3 5.1 0.0 TRINITY_DN8492_c0_g1_i2 280 338 (1283) C IS3 ARTEFACT (1161) 97 36 2


Could it mean no chimeric transcripts detected in this dataset ? I am using 3 replicates.

I also tested with your exemple data of drosophila, but at the final steps I am getting also an error message unfortunatly (Please see bellow), I also do not have the three expected final files:

The error message with the exemple_data:


[Thursday 21/9/2023 - 09h:07] Converting alignment to bed... Done! [Thursday 21/9/2023 - 09h:07] Masking transcripts with "flies" database... Done! [Thursday 21/9/2023 - 09h:08] Identifying chimeric reads... Done! [Thursday 21/9/2023 - 09h:08] Performing blast to identify transcripts... [Thursday 21/9/2023 - 09h:08] Done! [Thursday 21/9/2023 - 09h:08] Recovering the best matches... [Thursday 21/9/2023 - 09h:08] Done! Traceback (most recent call last): File "chimTE_mode2.py", line 175, in out_group = create_dir(str(out_dir + '/' + group)) TypeError: must be str, not float


I am running ChimeraTE on a cluster with slurm.

Let me know if I can help by providing any useful additional info

Thanks again for your kind support.

Best

OliveiraDS-hub commented 9 months ago

Dear Jeremy,

I'm not an Arabidopsis guy, but I've just found out that the default Dfam database does not have arabidopsis consensus on its "curated" version.

When you install ChimeraTE by conda, and you activate the env, it will have the RepeatMasker 4.1.2, with Dfam 3.3 database. Then, if you check for the consensus of thaliana, surprisingly there are only nine "artefact" consensus:

$path/miniconda3/envs/chimeraTE/share/RepeatMasker/famdb.py -i $path/miniconda3/envs/chimeraTE/share/RepeatMasker/Libraries/RepeatMaskerLib.h5 families --include-class-in-name -f fasta_name -ad 'arabidopsis' | grep '>'

>IS1#ARTEFACT @root [S:10]
>IS10#ARTEFACT @root [S:10]
>IS150#ARTEFACT @root [S:10]
>IS186#ARTEFACT @root [S:10]
>IS2#ARTEFACT @root [S:10]
>IS3#ARTEFACT @root [S:10]
>IS30#ARTEFACT @root [S:10]
>IS5#ARTEFACT @root [S:10]
>Tn1000#ARTEFACT @root [S:10]

So, what is happening is that RepeatMasker is masking your trinity transcripts based on these nine TE consensus, and your .out file is empty of interspersed repeats.

In this special case for arabidopsis, you have two options:

1- You can simply use a fasta file with consensus. Take a look on this database: athaTEref_RU.lib. It's publicly available on this website from the team who developed REPET https://urgi.versailles.inra.fr/Data/Transposable-elements/Arabidopsis

If you decide to use it, please just change the file name from athaTEref_RU.lib to athaTEref_RU.fa. It's silly but it might save your time

2- You can download the complete Dfam database (curated and uncurated consensus) and substitute it in the RepeatMasker folder installed in your ChimeraTE conda env. This can be a pain because the file has around 800Gb, and depending on the version you need to reinstall famdb too.

So, answering directly your question: "Could it mean no chimeric transcripts detected in this dataset ? I am using 3 replicates." No, it means the masking of your transcripts is not properly done.

Even so, I have to add a conditional statement to print in the stdout when it happens. The same way that it happened with arabidopsis, it might happen with some many other Dfam taxa. I'm going to do this enhancement asap. Thank you!


About the error with the test data, this is completely unexpected. Could you please open a new issue? Can you confirm if the example data for Mode 1, and Mode 2 without --assemblyhave worked properly?

I hope to hear from you soon!

Best

OliveiraDS-hub commented 6 months ago

I'm closing this issue.

Feel free to reopen it if you still have any question.

Best