RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

Unable to encode very short reference sequences. #33

Closed jarrettrios closed 4 weeks ago

jarrettrios commented 1 month ago

Hello!

I've been trying to Align against some small reference sequences but run into the error "maximum subunit ID in nongapped J table: 127; in gapped J table: 1". As far as I can tell, this is due to the nongapped encoding requiring 84 consecutive NT's without any N's. When I pad the reads with ACTGs such that there is 84 NT's on either side of the string of N's, it works. Is there a way to circumvent this without padding the reference sequences?

It's very possible what I need out of an aligner doesn't mesh with Arioc, but the potential speed up was too much to ignore!

Edit: Changed name of file to avoid confusion reference_templates.fa

>mutant
ccaacgaagaagaaattaaaactactaacccggtagcaacggagtcctatggacaagtggccacaaaccaccagnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnaccggttgggttcaaaaccaaggaatacttccgggtatggtttggcaggacagagatg
>wt
ccaacgaagaagaaattaaaactactaacccggtagcaacggagtcctatggacaagtggccacaaaccaccagnnnnnnnnnnnnnnnnnnnnnaccggttgggttcaaaaccaaggaatacttccgggtatggtttggcaggacagagatg

Encode_gapped.cfg

<AriocE gpuMask="0x00000001" seed="hsi18_0_30" maxJ="*">
  <dataIn sequenceType="R" srcId="0" filePath="/mnt/ramdisk/kraken/experiments/Experiment">
    <file SN="(*)" subId="1">reference_templates.fa</file>
  </dataIn>
  <dataOut>
    <path>/mnt/ramdisk/kraken/experiments/Experiment/enc/ref</path>
  </dataOut>
</AriocE>

Encode_nongapped.cfg

<AriocE gpuMask="0x00000001" seed="ssi84_2_30" maxJ="*">
  <dataIn sequenceType="R" srcId="0" filePath="/mnt/ramdisk/kraken/experiments/Experiment">
    <file SN="(*)" subId="1">reference_templates.fa</file>
  </dataIn>
  <dataOut>
    <path>/mnt/ramdisk/kraken/experiments/Experiment/enc/ref</path>
  </dataOut>
</AriocE>

Encode_Reads.cfg

<AriocE gpuMask="0x00000001">
<dataIn sequenceType="Q" srcId="1" QNAME="(*.*) */*">
<file subId="2">/mnt/ramdisk/kraken/experiments/Experiment/demux/sample_S_01_S1_R1_001.fastq</file>
</dataIn>
  <dataOut>
    <path>/mnt/ramdisk/kraken/experiments/Experiment/enc/reads</path>
  </dataOut>
</AriocE>

Align_upaired.cfg

<AriocU gpuMask="0x00000001" batchSize="256k">
  <R>/mnt/ramdisk/kraken/experiments/Experiment/enc/ref</R>
  <nongapped seed="ssi84_2_30" maxJ="1" />
  <gapped seed="hsi32_0_30" maxJ="*" Wmxgs="4_2_5_3" Vt="L, 0.6,0.6" AtO="99999999" />

  <Q filePath="/mnt/ramdisk/kraken/experiments/Experiment/enc/reads">
    <unpaired srcId="1" subId="2">
      <file>Sample_S_01_S1_R1_001</file>
    </unpaired>
  </Q>

  <A overwrite="true" mapqVersion="3" maxAperRead="1">
    <sam report="m">/mnt/ramdisk/kraken/experiments/Experiment/bam</sam>
  </A>

</AriocU>

Align log output:

Running arioc with command: AriocU /mnt/ramdisk/kraken/experiments/Experiment/align_reads.cfg
Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
 compiled             : 2024-05-29T01:43:11 (GNU g++ v11.4.0)
 data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
 computer name        : 2b1c291e4294
 operating system     : Ubuntu 22.04.4 LTS
 executable file      : /Arioc.x.151/src/AriocU
 configuration file   : /mnt/ramdisk/kraken/experiments/Experiment/align_reads.cfg
AriocBase::loadR: load R from disk...
AriocBase::loadR: loaded 1 R sequence: 19144 bytes (0.0GB) from 2 files in 0ms (infMB/s)
AriocBase::loadHJ: load H and J LUTs from disk...
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 16
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 12 bytes, ... ) on thread 17
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 5368709120 bytes, ... ) on thread 18
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
CudaPinnedPtr<T>::Realloc: cudaHostAlloc( ..., 1412 bytes, ... ) on thread 19
CudaPinnedPtr<T>::Realloc: back from cudaHostAlloc()
ApplicationException ([0x00000007] AriocBase/AriocBase.cpp 749): maximum subunit ID in nongapped J table: 127; in gapped J table: 1
AriocU ends (1)

Non Gapped Encoding log output

Running arioc with command: AriocE /mnt/ramdisk/kraken/experiments/Experiment/encode_nongapped_reference.cfg
AriocE v1.51.3140.23137 (release) +2023-05-17T01:06:49
Copyright (c) 2015-2023 Johns Hopkins University.  All rights reserved.
 compiled             : 2024-05-29T01:37:38 (GNU g++ v11.4.0)
 data type sizes      : int=4 long=8 *=8 Jvalue5=5 Jvalue8=8 JtableHeader=5 (little-endian)
 computer name        : 706d38edd182
 operating system     : Ubuntu 22.04.4 LTS
 executable file      : /Arioc.x.151/src/AriocE
 configuration file   : /mnt/ramdisk/kraken/experiments/Experiment/encode_nongapped_reference.cfg
AriocE::encodeR: encoding 1 file (48 CPU threads available)...
AriocE::encodeR: encoded 1 file
AriocE::sortCgpu: C list sort on GPU is unsupported.  Sorting on CPU instead.
AriocE::sortCcpu: sorting C list (1073741824 hash keys)...
AriocE::sortCcpu: sorted C list
AriocE::computeJlistSizes: m_nJ=0 (0+0) m_iCshort=0 m_iCzero=0
AriocE::computeJlistSizes: log2(nJ)  # J-lists
AriocE::computeJlistSizes: completed: J table uses 0 elements for 0 J values ( -nan% for list counts in J lists)
AriocE::buildJlists: building J lists for 1 file (2 sequences) (48 CPU threads available)...
AriocE::buildJlists: built J lists for 1 files (2 sequences)
AriocE::sortJgpu: sorting J lists on GPU...
AriocE::sortJgpu: J list sort complete
AriocE::countBBJ: counting big-bucket lists...
AriocE::countBBJ: counted 0 big-bucket lists
AriocE::validateHJ: start
AriocE::validateHJ: H table uses 0/1073741824 (0.000%) hash keys
AriocE::validateHJ: J table contains 0 J values (list counts in lists: 0 ( -nan%))
AriocE::validateHJ: maximum nJ=0 for hash key 0x00000000
AriocE::validateHJ: completed
AriocE::validateSubIds: start
AriocE::validateSubIds: validated 1073741824 hash values
AriocE::validateSubIds:  # subIds  # hash keys  avg J-list size
AriocE::validateSubIds:  subId Jf Jrc
AriocE::validateSubIds: completed
AriocE::finalizeJ: finalizing J lists on 48 worker threads...
AriocE::finalizeJ: completed
AriocE ends (0)
RWilton commented 1 month ago

Thank you for the detailed documentation of the problem with encoding short reference sequences.

The explanation for the 84-base minimum is simple: the nongapped seeding algorithm uses an 84-bit spaced seed. It would be possible to retool Arioc to allow some flexibility in the spaced seed, but that would require a lot of testing and validation before it could be used.

But that limitation should not apply in the situation where you aggregate your short reference sequences in a single FASTA file. The encoded binary representation puts padding between the sequences (analogous to your experiment with adding Ns to short reference base sequences) and handles both sequence naming (extracted from FASTA definition lines) and sequence topology (linear or circular).

From the AriocE log, it looks like there were over a billion seeds computed -- but none of them made it into the hash table! I cannot see the reason for this from the log, but I suspect this FASTA file embodies an edge case that's never been tested, namely, a set of reference sequences all of which are too short to be seeded with 84-bit seeds. If so, AriocE needs to handle this more gracefully than simply failing with -nan% errors!

May I ask: what is the distribution of sequence length in the sequences in the FASTA file?

jarrettrios commented 1 month ago

Edit: Thank you for your assistance ahead of time!

The sequences in this case are all 150 bp reads from an Illumina nextseq 1000. My particular use case requires to align 400+ million reads that are all exactly the same in most regions, with an area of variation (That's the regions with N's) that we are trying to extract. We then count how many times we see a sequence to conclude an enrichment score.

On that note as an aside, I noticed that the output says that Arioc discards duplicate reads, would that mean that by default Arioc would not be suited for the task? I was thinking if thats the case I could just edit the source files locally to just skip that part!

We do have reference sequences that can go up to 1000 bp (rarely) and reads that can go up to 500 bp, but I just happened to choose this test case first because it was pretty simple.

RWilton commented 1 month ago

Hi, Jarrett -- My question was, what is the distribution of sequence length in the sequences in the FASTA file, that is, the distribution of sequence length of the reference sequences in reference_templates.fa?

You've specified a wildcard SN (SN="(*)") in the AriocE config, which implies that there are multiple reference sequences in the input FASTA file, each of which has a unique name specified in its defline. This should permit you to use reference sequences shorter than 84 bases, although those reference sequences can obviously only get seeded for the gapped aligner. For what it's worth, however, AriocE should handle the two reference sequences you provided because they're both 150+ bases long. So something's fishy!

I can't yet tell from what I can see so far, but something is clearly going haywire when AriocE tries to encode your reference sequences for nongapped alignment. If nothing else, I want to fix AriocE's handling of this situation so that emits an informative error message and not just garble!

If you prefer, share your input FASTA file somewhere, or send it to me via some file transfer service, and I'll put AriocE under a debugger and see where things are going sideways.

As for the question about "duplicate reads": I'm not sure what you're referring to. Certainly the Arioc aligners (AriocU and AriocP) do not check for duplicate reads. There are dedicated software tools for that. Could you please tell me more specifically what you're referring to?

Thanks!

jarrettrios commented 1 month ago

Hi, Jarrett -- My question was, what is the distribution of sequence length in the sequences in the FASTA file, that is, the distribution of sequence length of the reference sequences in reference_templates.fa?

You've specified a wildcard SN (SN="(*)") in the AriocE config, which implies that there are multiple reference sequences in the input FASTA file, each of which has a unique name specified in its defline. This should permit you to use reference sequences shorter than 84 bases, although those reference sequences can obviously only get seeded for the gapped aligner. For what it's worth, however, AriocE should handle the two reference sequences you provided because they're both 150+ bases long. So something's fishy!

Oh! I'm sorry, I named the top "file" incorrectly, instead of reference_sequences.fa is should be reference_templates.fa . Those two (Mutant and WT) are the actual reference sequences I tried to encode. I really was only using two in this instance. I'll write them below again for quick reference.

>mutant
ccaacgaagaagaaattaaaactactaacccggtagcaacggagtcctatggacaagtggccacaaaccaccagnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnaccggttgggttcaaaaccaaggaatacttccgggtatggtttggcaggacagagatg
>wt
ccaacgaagaagaaattaaaactactaacccggtagcaacggagtcctatggacaagtggccacaaaccaccagnnnnnnnnnnnnnnnnnnnnnaccggttgggttcaaaaccaaggaatacttccgggtatggtttggcaggacagagatg

I did notice that it was specifically that repeated "N"s that caused the issue with the nongapped encoding. If I make sure that there are 84 non-degenerate bases on either side of the N's it works. If I take the N's away it works. But if there are less than 84 non-degenerate bases on both sides of the N's, then we run into the issue.

To get it to run (I wanted to test the speed regardless of this issue) I simply added padding non-degenerate bases to the reference sequences to get around this. Though this understandably caused issues with alignment so I figured I'd have to open an issue here to dig deeper.


As for the question about "duplicate reads": I'm not sure what you're referring to. Certainly the Arioc aligners (AriocU and AriocP) do not check for duplicate reads. There are dedicated software tools for that. Could you please tell me more specifically what you're referring to?

I'm referring to the line in the output of for alignment that says "duplicate mappings (unreported)". I think I was imprecise in my words when I said "duplicate reads" instead of "duplicate mappings".

----------------------------

  SAM output:
   reads                          : 20836093 (20836093 SAM records)
   mapped reads                   : 20582397 (20582397 SAM records) (98.78%)
    with 1 mapping                :   455131 (  455131 SAM records)
    with 2 or more mappings       : 20127266 (20127266 SAM records)
   unmapped reads                 :   253696 (  253696 SAM records)
   **duplicate mappings (unreported):        1**
RWilton commented 1 month ago

I think I see the problem.

Given a reference sequence where there is no contiguous set of adjacent non-N bases that is at least as long as the nongapped seed length, AriocE is (correctly) reporting that no seeds were found but (incorrectly) failing to emit a usable empty list of seed locations for that reference sequence. (As you have observed, reads can still be aligned to such sequences using gapped alignment seeds, which are much shorter.)

I will dig into this and fix it.

As for the duplicate mappings: this is a small piece of information that can be helpful in performance tuning. When the aligner (AriocU or AriocP) is configured to report secondary mappings (i.e., the AtO= parameter is 2 or greater), it may discover the same mapping(s) for any given read more than once (through both nongapped and gapped alignment, or through aligning at multiple seed locations in parallel).

The aligner does not report the same mapping twice for the same read, but it does count them. When the aligner counts a lot of these duplicate mappings, you might increase speed without affecting sensitivity by decreasing AtO= so the aligner can report fewer mappings per read and not waste time looking for secondary mappings that aren't there. (See the "Tuning" sections of the Arioc user guide for more context.)

jarrettrios commented 1 month ago

Ah! Thank you for the clarification on the duplicated mappings. I will continue on my integration of Arioc and let you know of any other edge cases I run into!

RWilton commented 1 month ago

I have created an updated Arioc v1.52 as a preliminary release. It contains a fix for the problem you reported.

jarrettrios commented 4 weeks ago

I can confirm I am able to encode my short references without padding now! That's awesome thanks!

RWilton commented 4 weeks ago

That is good news. Thank you for reporting the problem and giving us a chance to resolve it.