Cant seem to reproduce article results in figure 4A

NicolasProvencher commented 7 months ago

Hi,

Thanks for your tool

Before running riboformer on all my data i tried reproducing the result you gave in the article as a way to help me identify a treshold to use as orf/protein evidence from riboseq So far I have done everything like the article except for the xRNA filtering since you told me in a previous issue to forgo it because the model would do it by normalizing.

Now since in your article you looses 13 to 90% of your read by doing the filtering of xRNA this might explain the difference but i have 2 question.

By comparing my results to yours, 2 thing becomes apparent. First, I have more biotypes (cds variant, n-terminal extension and truncation and other) than you do. Since I'm using exactly the orf_biotype in the csv output i was wondering if it was due to the unfiltered on xRNA or if when you did your figure you merged certain biotype into others or if you ignored the biotypes i have in extra.

Second, The amount of lncRNA-ORF for my analysis is way lower in my results than in yours. I was wondering if it was because most of the lncRNA detection from your results disappeared due to the normalization thus disappearing when you omit to filter xRNAs because the amount of noise raises.

If the solution end up being I need to filter for xRNA, can you give me more details on how you did it, (how was the star genome for xRNA was generated)

I tried getting the .csv file from your article by running riboformer --results on the .npy shared on zenodo but it seems to break here (line 201 of processing.py)

if has_ribo_output: tr_ids = np.array([o[0].split(b"|")[1] for o in ribo])

due to a index out of bound error (the split doesnt split anything because the first element looks like this [b 'tr_id'] and there is no "|" anywhere)

Genome and annotation used is ensembl v107 star command used is the same as in the preprint supplementary treshold used to make the biotype graph is 0.6 like in the article

File	Read count after align	CDS count
SRR1802129	1.93E+07	18,627
SRR2433794	2.66E+07	32,230
SRR2732970	2.57E+08	5,038
SRR2733100	3.01E+08	5,620
SRR2954800	4.35E+06	6,284
SRR8449577	5.87E+07	2,884
SRR9113067	1.11E+07	457
SRR11005875	2.48E+07	935

jdcla commented 7 months ago

Hey Nicolas,

Thank you for your continued interest and feedback. To answer your questions:

By comparing my results to yours, 2 thing becomes apparent. First, I have more biotypes (cds variant, n-terminal extension and truncation and other) than you do. Since I'm using exactly the orf_biotype in the csv output i was wondering if it was due to the unfiltered on xRNA or if when you did your figure you merged certain biotype into others or if you ignored the biotypes i have in extra.
Second, The amount of lncRNA-ORF for my analysis is way lower in my results than in yours. I was wondering if it was because most of the lncRNA detection from your results disappeared due to the normalization thus disappearing when you omit to filter xRNAs because the amount of noise raises.

Unfortunately, the reported biotypes from the preprint article contain several misclassifications that have been fixed in later releases of RIBO-former. For example, there is a large number of lncRNAs that are now reported as CDS variants. This is the case for transcript isoforms that share an ORF where either the TIS or TTS overlaps with the genomic position of a canonical CDS. There is also a large number of intORFs that were incorrectly labeled as such and were actually N-terminal extentions/truncations.

I tried getting the .csv file from your article by running riboformer --results on the .npy shared on zenodo but it seems to break here (line 201 of processing.py)

Since the release of the pre-print, the tool has continued to evolve. Outputs are now formated as {ribo_id}|{transcript id} rather than just {transcript id}. Its possible that simply altering all id's in the npy files to include a single random prefix (e.g., foobar|{transcript_id}), which should hopefully work. Nonetheless, the best way to evaluate parity with reported results is to evaluate the ROC AUC/PR AUC. Both the new and old results should have both the predictions and true labels, so it should be straightforward to just concatenate all predictions/true values of all transcripts into single vectors to calculate the scores.

{results of the table}

The number of called CDSs from both figures do not seem to agree. E.g., from the rank progression, it would seem SRR2733970 would have the highest CDS count.

I have not investigated myself what the result is of omitting the xRNA filtering step, but as noted in a previous conversation, I do not expect this to affect results much. It will largely affect the computing time to map and parse the reads into hdf5 format. Since RIBO-former uses aggregated data (i.e., read counts per position), this doesn't affect the size of the data during the fine-tuning step much. Additionally, reads are normalized per transcript so read count fluctuations across the genome won't have an effect either.

I appreciate your work looking into this and apologize for the inconsistencies with the current tool and pre-print paper. I am finishing up a manuscript right now that I am hoping to submit this month. A pre-print of this new manuscript should be available as well.

J

jdcla commented 7 months ago

To follow up on this, a healthy dataset with good read depth where results have been filtered down to only include *TG start codons looks similar to this following the new approach of orf typing:

Notice that this is on data not discussed in the current pre-print. the threshold to select a positive set is at 0.15.

NicolasProvencher commented 7 months ago

Thanks a lot this clears up most of the differences I had

TRISTAN-ORF / RiboTIE

Cant seem to reproduce article results in figure 4A #6