Weird output from multiple replicates isoform mode

DepledgeLab / DRUMMER

DRUMMER: Detection of RNA modifications in nanopore direct RNA Sequencing datasets

GNU General Public License v3.0

21 stars 1 forks source link

Weird output from multiple replicates isoform mode #15

Closed FDallaPozza closed 2 years ago

FDallaPozza commented 2 years ago

Hi, I tried to use the multiple replicates isoform mode to perform an analysis, using the same command provided in the documentation, I got a weird output where it seems some columns are missing with respect to how the output should be. This is the output:

In particular, for what I need from the output, "transcriptID" and "position" seems to be referred only to the genome, and "strand" is missing. Am I doing something wrong? Best Regards. Fabio

DepledgeLab commented 2 years ago

This is very curious. Can you paste the full command used?

FDallaPozza commented 2 years ago

This is the code:

demo_yeast.fa is the genome, chromosomes.txt the list of the chromosomes and all the other the correct files

DepledgeLab commented 2 years ago

As I understand it, you have aligned your reads against a genome containing multiple chromosomes and you are then running DRUMMER in isoform mode where the 'isoforms' are the chromosome IDs?

This will give the output you observe with 'transcripID' reflecting the chromosome number and 'position' referring to the position within the chromosome.

Because you are running the analysis in isoform mode, the internal assumption is that the only alignments retained/used are those against the top strand. If you wanted similar information for the reverse strand, you would need to supply an alignment against a reverse-complemented genome file.

Given you are working with yeast, would it not be simpler to work with a transcriptome-level alignment instead of a genome-level alignment?

We'll look into adding the ability to run 'exome' level analyses against multiple chromosomes but this will take a little time to implement.

FDallaPozza commented 2 years ago

No I'm running DRUMMER in exome mode (-a exome), the code above was an example and inside chromosomes.txt there was only one chromosome. What I did then was to implement a for loop using another chromosomes.txt file (this time containing all the chromosomes) and so the code was repeated for each chromosome inside the file. Btw the first output I showed was referred only to one chromosome, as specified in the code pasted above

JonAbebe commented 2 years ago

Really interesting question. I suspect when iterating through your text file, the bam files (-c, -t arguments) are always kept constant. The issue with this is that DRUMMER uses this information for downstream directory names, which means that previous outputs can be overridden as you progress through the loop.

Could you tell me what is contained in one of the complete_analysis folders? If there are DRUMMER outputs for each chromosome, we might be able to directly call summary.py and multiple_combine.py using these files.

However, I do think the advice of DepledgeLab is best. Since we designed DRUMMER initially for viruses with 1 "chromosome", exome mode may not be best in this case. The best course of action would be treating these different chromosomes as their own transcript and running this analysis in isoform mode.

FDallaPozza commented 2 years ago

I set the output folder with the -o option, so in every iteration the output is a folder with the name of the chromosome, like this:

For example inside the folder XVI/ :

The file multiple_comp.txt is the one I pasted in the first message, inside the folder inimap.sortG.1.1-inimap.sortG.2.1 I got :

And this shows some lines of XVI.complete.txt file inside the complete_analysis folder:

JonAbebe commented 2 years ago

So I looked through some old data that we ran in exome mode. Looks like the output you are getting is correct.

We split exome mode data into forward/reverse during the initial alignment, so when running DRUMMER you will have 2 results (fwd/rev) for each chromosome.
The transcript-ID column is not going to be present because we designed this for viral genomes, which means each position in exome mode could refer to multiple transcripts (due to overlapping transcriptional units), which would be less than helpful for our data.

However, for your data you could (1) Separate your bam files into fwd/rev and by using information from the resulting data (chromosome, position, strand) you can map to a transcript by using a GTF file.

FDallaPozza commented 2 years ago

Ok thank you for the help, so I think I'll run it in the isoform mode and then perform a lift-over to obtain all the genomic coordinates. Best, Fabio