cfe-lab / proviral

0 stars 0 forks source link

Errors in HIVSeqinR #12

Closed CBeelen closed 7 months ago

CBeelen commented 1 year ago

Since we've started automatically launching the proviral pipeline as part of MiCall v7.15, we've seen two errors where HIvSeqinR crashed:

The first was sample HIV3202Y4604SGA-PLT4K11 from run 220610_M01841. For this one, we did some digging into the R code. The error message is the following: Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord, : solving row 1: 'allow.nonnarrowing' is FALSE and the supplied end (8508) is > refwidth Calls: DNAStringSet ... make_IRanges_from_windows_args -> solveUserSEW -> .Call2

This is happening in line 1369 of the script modified.R (which is a copy of R_HIVSeqinR_Combined_ver09_ScrambleFix.R with modified paths), for the region HXB2_rev-exon2_nuc_8379-8650: Temp_REVexon2GP41_NUC <- DNAStringSet(MyWorkingAln_Nuc_DNAStringSet,start=min(MyIndex_ATCG),end=max(MyIndex_ATCG)+60) The start is 7785 and end is 8509. But the width of the two objects in MyWorkingAln_Nuc_DNAStringSet is just 8448 - so my guess is that the DNAStringSet tries to narrow the sequence to a smaller range, but the sequence is already shorter, and that may be why it's unhappy.

The other sample is 2929-P26-Y02944-9F from run 210618_M05995. The error message is: Error in .normargSEW(start, "start") : 'start' must be a vector of integers Calls: subseq ... make_IRanges_from_windows_args -> solveUserSEW -> .normargSEW

It looks like the problem is occurring in the same function again (solveUserSEW), but this time it's called from line 728 in the code, in the AUTOTRIM section.

CBeelen commented 1 year ago

Leaving this here for future reference. Adding this to modified.R creates a stack trace from R, which is a lot more helpful than the simple error message cited above:

options(error = function() {
  sink(stderr())
  on.exit(sink(NULL))
  traceback(3, max.lines = 1L)
  if (!interactive()) {
    q(status = 1)
  }
})

From this helpful blog

CBeelen commented 1 year ago

Guin very graciously helped us track down these errors, and these are the root causes:

For the first sequence, rev exon 2 aligns very weirdly to hxb2, with a large deletion - so the alignment introduces a large gap. This leads to the index not being calculated right and the proposed rev sequence being way too long - therefore, the script throws an error. To fix this, it is possible to re-run HIVSeqinR with a larger gap opening penalty in the aligner (Muscle) - setting it to -8000 instead of the usual -5000 solves this weird alignment. Guin does not recommend setting it to -8000 overall, though - in her experience, -5000 delivers the best results overall. Instead, we might be able to improve the code in the future, to re-try an alignment with different alignment parameters when it fails, to give up on non-critical regions, and to continue execution of the other regions even if one fails.

For the second sequence, the synthetic primer does not align properly to the sequence (the AUTOTRIM section detects and removes the primer). In case of a weird alignment, this section fails. In our case, it looks like a chunk of the primer sequence got duplicated. I looked at the sequence before the "real" primer got removed and the "synthetic" primer got added, before the HIVSeqinR analysis - it looks like this code is doing something weird in this case. From the intermediate files (conseq_primer_analysis and contig_primer_analysis), it looks like the primer was correctly detected, but something must have gone wrong when actually trimming it from the sample sequence: the no-primer sequence still has a bit of the real primer at the start, which the synthetic primer then gets added to.

Donaim commented 7 months ago

As we are switching to HIVIntact, these crashes are no longer relevant.

There is no similar issue for HIVIntact (it does not crash).