LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
42 stars 10 forks source link

temporary offset (local maximum) mis-identification #79

Open iraiosub opened 2 months ago

iraiosub commented 2 months ago

hey! thanks for developing this fantastic tool! I've been using it with great results and included it in a number of pipelines, with plans to include it the the nf-core/riboseq pipeline as a P-site identification+diagnostics option. While working with some test data (smaller size than any typical ribo-seq dataset), I noticed in cases where there are missing read extremities at certain positions relative to the start codon, results in mis-identifying what should be the correct temporary offset. E.g. visually the temporary offset clearly should be at the -12 position as seen in the offset plot below (also cross-validated by other tools such as ribotish), but is assigned by the psite function as -17.

28.pdf

I have tracked the issue and identified it originates in the tempoff function, which introduces NAs if there are no read ends at the adjacent positions, here -13 and -11 for the offset of -12 which is the most frequent in the data, causing the -12 to be dropped and no longer a valid offset option.

This occurs with both versions 1.2.0 and 2.0. I am attaching the files needed to reproduce the issue.

test_data.zip

I will shortly follow with a PR with a proposed solution, if you would like to review it and make adjustments as needed.

I appreciate this might not be an issue with most full-sized datasets, but it would be good to solve this and make a patch release here and on bioconda, as this is required for using it as part of reproducible workflows - the community would benefit greatly from it :)

iraiosub commented 2 months ago

oh, actually I do not have access to contribute, so here's a proposed quick fix - to be used inside of the psite function:

tempoff <- function(v_dist){

  ttable <- sort(table(v_dist), decreasing = T)

  # Create right-shifted table with zero filling for non-existing positions
  right_indices <- as.character(as.numeric(names(ttable)) + 1)
  ttable_sr <- ttable[right_indices]
  ttable_sr[is.na(ttable_sr)] <- 0  # Replace NA with 0

  # Create left-shifted table with zero filling for non-existing positions
  left_indices <- as.character(as.numeric(names(ttable)) - 1)
  ttable_sl <- ttable[left_indices]
  ttable_sl[is.na(ttable_sl)] <- 0  # Replace NA with 0
  tsel <- rowSums(cbind(ttable > ttable_sr, ttable > ttable_sl), na.rm = T)

  return(as.numeric(names(tsel[tsel == 2][1])))
  }
fabiolauria commented 2 months ago

Ho there, thank you for using riboWaltz. You are perfectly right, in some specific cases the tool doesn't work. Lately several issues related to low input data popped out, which makes sense since riboWaltz has been designed for standard RiboSeq with "enough" material.

Thank you so much for bothering to inspect the code and propose an alternative. I will implemented the solution you kindly provided as soon as I'm free from other, urgent duties.

Best Fabio

iraiosub commented 2 months ago

thank you very much Fabio! That's really good to know.

Cheers, Ira