OliveiraDS-hub / ChimeraTE

A pipeline to detect chimeric transcripts derived from genes and transposable elements.
GNU General Public License v3.0
18 stars 4 forks source link

ChimeraTE Mode 1 with bam files #6

Closed OliveiraDS-hub closed 1 year ago

OliveiraDS-hub commented 1 year ago

Originally posted by @Lynuxoo in https://github.com/OliveiraDS-hub/ChimeraTE/issues/5#issuecomment-1506478704

"Thank you so much for your previous response, I was able to run the program successfully.

However, I have encountered a new issue. I noticed that ChimeraTE requires a genome file and an input file, but I already have STAR alignment results. Is there a way to use my existing alignment results as input for ChimeraTE instead of re-aligning with STAR?

Thanks for any suggestions or pointers to relevant documentation."

OliveiraDS-hub commented 1 year ago

ChimeraTE Mode 1 does not have a direct option to do it, but since STAR alignment is one of the first steps of the pipeline, you can adapt the script to do so. In order to avoid large modifications in the script, we must follow a few instructions:

You have to provide a table with three columns with --input parameter. The fastq files or the information given in the first two columns will be useless, but the third column (sample name) is really important. The name that you use in the third column (rep1, rep2 in the example data) will be used to create a folder within the --project folder. Then, within each sample folder, you must have an "alignment" folder. Just like that:

ChimeraTE ├── projects      └── --project name          ├── rep1          │  └── alignment          └── rep2            └── alignment

Check it out the projects folder after run the example data. Then, after create these folders, you must move your bam files from STAR to the "alignment" folder, like that:

ChimeraTE ├── projects      └── --project name          ├── rep1          │  └── alignment          │       └── rep1_Aligned.sortedByCoord.out.bam          └── rep2          │  └── alignment          │        └── rep2_Aligned.sortedByCoord.out.bam

Note that the all files must finish with "_Aligned.sortedByCoord.out.bam" (default from STAR), and they must begin with the sample name (third column of input table).

Finally, you just need to insert # at the beginning of the following lines of chimTE_mode1.py:

137

and the lines of mode1_alignment.py

18 to 21

Be sure of providing gtf files corresponding to the same fasta file that you have aligned your reads.

Lynuxoo commented 1 year ago

Thank you so much for patiently answering my questions!

Following your guidance, I placed the STAR alignment results into the appropriate folder. However, during the execution, I encountered an error message: [main_samview] fail to read the header from "*_Aligned.sortedByCoord.out.bam". samtools index: "accepted_hits.bam" is in a format that cannot be usefully indexed [main_samview] fail to read the header from "accepted_hits.bam". samtools index: "fwd1_f.bam" is in a format that cannot be usefully indexed [main_samview] fail to read the header from "accepted_hits.bam". samtools index: "fwd2_f.bam" is in a format that cannot be usefully indexed [W::hts_set_opt] Cannot change block size for this format samtools merge: failed to read header from "fwd1_f.bam" [E::hts_open_format] Failed to open file "fwd.bam" : No such file or directory samtools index: failed to open "fwd.bam": No such file or directory [main_samview] fail to read the header from "accepted_hits.bam". samtools index: "rev1_r.bam" is in a format that cannot be usefully indexed [main_samview] fail to read the header from "accepted_hits.bam". samtools index: "rev2_r.bam" is in a format that cannot be usefully indexed [W::hts_set_opt] Cannot change block size for this format samtools merge: failed to read header from "rev1_r.bam" [E::hts_open_format] Failed to open file "rev.bam" : No such file or directory samtools index: failed to open "rev.bam": No such file or directory [E::hts_open_format_impl] Failed to open file fwd.bam Failed to open BAM file fwd.bam [E::hts_open_format_impl] Failed to open file rev.bam Failed to open BAM file rev.bam

I suspect that this issue may be related to my use of raw data from STAR(with the parameter "--outSAMtype BAM Unsorted"). Do I need to use SAMTOOLS with specific parameters to address this?

Thank you in advance for your patient response again.

OliveiraDS-hub commented 1 year ago

Dear @Lynuxoo

Indeed, your bam file was generate with different parameters than ChimeraTE, in which we use --outSAMtype BAM SortedByCoordinate. The first error is happening because samtools is trying to index a bam file that is not sorted.

You can sort them with samtools before run ChimeraTE. samtools sort in.bam > out_sorted.bam

Remember to follow the rules of the files name: rep1_Aligned.sortedByCoord.out.bam; where "rep1" is the third column of your --input file.

Try it out with one bam file (input.tsv with only one line) and let me know if it's working.

Lynuxoo commented 1 year ago

Thank you again for your helpful guidance. Following your suggestion, I used SAMTOOLS to process my BAM files and successfully completed the first stage of the analysis after following the instructions provided.

However, I'm currently encountering an error while attempting to perform Gene Expression Analysis. Traceback (most recent call last): File "ChimeraTE/chimTE_mode1.py", line 158, in <module> alignment_func(out_dir,group,aln_dir,mate1,mate2) File "ChimeraTE/mode1_alignment.py", line 84, in alignment_func TEfile = pybedtools.BedTool(str(tmp + '/TE_file.bed')) File "/miniconda3/envs/chimeraTE/lib/python3.6/site-packages/pybedtools/bedtool.py", line 528, in __init__ raise FileNotFoundError(msg) FileNotFoundError: File "ChimeraTE/projects/tmp/TE_file.bed" does not exist

I have confirmed that the working directory and input files are correct, and I suspect that there may be some other issues that I have overlooked.

OliveiraDS-hub commented 1 year ago

@Lynuxoo I'm sorry, I thought you had kept the files created by the pipeline in your first error. The TE_file.bed is missing in your temporary project folder. To solve it, remove the # from line 137 of chimTE_mode1.py. Then, open the file mode1_prep_data.py and comment the lines 35-40. It's not required, but you will save time if you don't create STAR index for your genome.

Let me know if it worked. Let's solve this issue together.

Lynuxoo commented 1 year ago

Thank you very much for your patient response! I apologize for the delay in providing feedback. I followed your suggestions from our last conversation, but unfortunately, I still encountered some error messages. The main challenge I faced was the excessively long runtime of the code, which made the debugging process extremely difficult. Specifically, I'm working with human brain RNA-seq data, and the processing time for a single sample exceeds 120 hours. Considering that I have over a hundred samples, I'm uncertain whether this extended processing time is expected for ChimeraTE when handling human transcriptome data, or if it's simply due to my improper usage.

OliveiraDS-hub commented 1 year ago

Dear @Lynuxoo

I'm sorry for the very delayed answer. I was waiting the answer from the journal before update you. Finally ChimeraTE has been accepted.

Unfortunately, ChimeraTE Mode 2 relies on Trinity assembly, which increases a lot the processing time. It depends a lot of your question, but many chimeras can be found without using the --assembly option. If you really need the transcripts, you can mitigate the time increasing the number of threads and RAM memory (--threads and --ram, respectively), but in any case performing transcriptome assembly for each RNA-seq replicate will be time consuming.

I'm going to close this issue, since the last answer is no longer about the subject.