AlexanderBartholomaeus / smORFer

2 stars 2 forks source link

Connecting output of module B and C #1

Closed RickGelhausen closed 2 years ago

RickGelhausen commented 3 years ago

Hi,

I want to use your tool to analyze an E.coli dataset containing both classical RiboSEQ data and TIS data.

So far, I have been able to run all the modules independently as described in your documentation. In your publication, Figure 1 describes the workflow quite nicely and indicates that the results of Module B and C can be merged/used together to find new smORFs.

However, I do not understand, based on your documentation, how to combine the results. From what I understand, I would use the resulting "TIS_candidates.txt" to filter the results in the "RPF_translated.txt" (Module B step4) or "RPF_3nt_translated.txt" (Module B step5). Would this be the correct way to use the results?

Thank you in advance for your help.

AlexanderBartholomaeus commented 3 years ago

Hi Rick,

both module B and module C use BED files as input. In general this can be anything you like (as long as the chromsome + location information fit the information in the bam files).

You suggested to use the results from step 8 (TIS_candidates.txt) and use this in step 4 or step 5. I would go from general to specific. This mean: module A is most general and can gives all possible ORFs, module B is more specific and will give ORFs with ribosomal signature (it is not really clear where the ribosome started) but module C gives only ribosomes that start (in theory). Thus, I suggest to follow the numbered steps from low to high and I would use the results from step 4 or 5 and apply it to step 7+8.

Our results in the smORFer paper support that TIS data is most specific. Unfortunately, it is only rarely available. I am very curious to see more data and more analysis on the same data.

Best, Alex

RickGelhausen commented 3 years ago

Hi Alex,

thank you for your quick reply. I see now that my issue was written in a confusing way.

I already ran all the modules as specified in the documentation, the question was about how to use the 3 output files together in the end, as shown in Figure 1 in the publication, to get a list of smORFs.

So what I wanted to ask was not whether I can use step 8 as an input for step 4+5 but rather whether I can use the step 8 results and overlap the start codons determined like this with the results from step 4 or step 5, to get a list of smORFs (with verified start and stop codon).

But if I understand you correctly, I can directly use the results of step 4+5 to run step 7+8, rather than using the pORFs from step 1. That should give me a list with valid start codons, which I can the use to collect a list of smORFs from the previous predictions.

Sorry, I was confused by the Figure 1, as it looked to me like Module B and C should be run independently and then used together later.

I was able to observe the same regarding TIS data, it is a very valuable method to determine novel sORFs.

Cheers, Rick

AlexanderBartholomaeus commented 3 years ago

Hi Rick,

you are right in Figure 1 the schema shows that all ORFs from module A are used for module B (steps 4+5) and module C (step 7+8).

Would it help if I add a small helper script to intersect full BED files from full ORFs (as resulting from step4+5) with BED files with the start codons (as resulting from step 7+8)?

RickGelhausen commented 3 years ago

Hi Alex,

I think that would be very beneficial to me and future users of the tool. This would enable users to get a quick table of results to check.

In the meantime I had some time to analyze my results in more detail. I still think such an intersection would be nice to find strong candidate ORFs. But I noticed that this only makes sense if both RIBOseq and TISseq data are of similar quality.

In my case the RIBOseq data has quite low coverage, and the TIS data is quite good. This results in a loss of good candidates when only looking at the intersection. But from my experience, RIBOseq data is usually of higher or equal quality than the current TIS data.

AlexanderBartholomaeus commented 2 years ago

Hi @RickGelhausen

Finally I added a new helper script called 'overlap_candidates.R' and updated the documentation. This script is supposed to work on the TIS candidates from module C and find the overlap of results from module A or B. Maybe you find the time to confirm that it is working.