NKI-GCF / XenofilteR

Filtering of PDX samples for mouse derived reads
GNU General Public License v3.0
27 stars 7 forks source link

long vectors not supported yet #7

Open sztup opened 5 years ago

sztup commented 5 years ago

Hello,

I get the following error message: Cigar.matrix <- cigarOpTable(Human[[1]]$cigar) Error in .Call2("cigar_op_table", cigar, PACKAGE = "GenomicAlignments") : long vectors not supported yet: memory.c:3486

I guess I have too many reads: length(Human[[1]]$cigar) [1] 354266489

The Cigar.matrix has 9 rows and 354266489*9 is more than 2^31, I thought it shouldn't lead to an error, because I am using R>3.5 and 64-bit system. Is it possible to change some setting or would it make sense to split the BAM files? If yes, what would you recommend?

Thanks,

Used RAM: 400GB, R version 3.5.0 (2018-04-23) -- "Joy in Playing" Copyright (C) 2018 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) SessionInfo(): ... other attached packages: [1] XenofilteR_1.6 Rsamtools_1.34.1 Biostrings_2.50.2 [4] XVector_0.22.0 GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 [7] IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 [10] BiocParallel_1.16.6 RLinuxModules_0.2

loaded via a namespace (and not attached): [1] zlibbioc_1.28.0 GenomicAlignments_1.18.1 [3] lattice_0.20-38 tools_3.5.0 [5] SummarizedExperiment_1.12.0 grid_3.5.0 [7] Biobase_2.42.0 lambda.r_1.2.3 [9] futile.logger_1.4.3 matrixStats_0.54.0 [11] Matrix_1.2-17 GenomeInfoDbData_1.2.0 [13] formatR_1.7 futile.options_1.0.1 [15] bitops_1.0-6 RCurl_1.95-4.11 [17] DelayedArray_0.8.0 compiler_3.5.0

ApurvaG05 commented 4 years ago

Hi,

I have the same issue while running XenofilteR. Is there a fix to this? Have you identified a solution and got it to run successfully ?

Thanks in advance. -Apurva

PeeperLab2 commented 4 years ago

Dear Apurva,

Unfortunately the limit originates in the Rsamtools package. The Bioconductor team is working on a new version of Rsamtools to solve this issue but it is not available yet and I do not know when it will be. A solution would be to split your fastq files in smaller fastq files, map each to mouse and human and run XenofilteR. After that you can merge the bam files again. It is not an ideal situation, I know.

Another option would be to use the Perl implementation from Roel Kluin: https://github.com/PeeperLab/XenofilteR/tree/original/original

Without the Rsamtools support for long vectors I do not see an easy solution in the XenofilteR package. I’ll try to dig into the error once more, see if I can come up with another (easier) solution.

Best Oscar

On 22 Oct 2019, at 11:31, ApurvaG05 notifications@github.com<mailto:notifications@github.com> wrote:

Hi,

I have the same issue while running XenofilteR. Is there a fix to this? Have you identified a solution and got it to run successfully ?

Thanks in advance. -Apurva

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/PeeperLab/XenofilteR/issues/7?email_source=notifications&email_token=AB7X5RHE7NMDFGFR4ZAG4J3QP3B7NA5CNFSM4HY7H6S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB5DKVY#issuecomment-544879959, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7X5RDUGT5UWMBKAO7WMF3QP3B7NANCNFSM4HY7H6SQ.

pidoc commented 4 years ago

Hi,

any updated on this matter?

Thanks Johann

PeeperLab2 commented 4 years ago

Hi,

I am searching for a solution to run XenofilteR with bam files that exceed the Rsamtools limit but I have unfortunately no implementation yet that works well. As soon as I have something I will post it on the Github page.

Kind regards, Oscar

On 14 Feb 2020, at 17:14, pidoc notifications@github.com<mailto:notifications@github.com> wrote:

Hi,

any updated on this matter?

Thanks Johann

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/PeeperLab/XenofilteR/issues/7?email_source=notifications&email_token=AB7X5RD72PVSCFYIMKVSDYDRC27PPA5CNFSM4HY7H6S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELZRVZA#issuecomment-586357476, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7X5RECCBNL6LWDNU7SWHDRC27PPANCNFSM4HY7H6SQ.

imcoleman commented 4 years ago

Just curious whether this has been fixed with R version 4.0? I have some very large bam files to filter and am getting this error (which never came up previously with smaller files.) Thanks, Ilsa

PeeperLab2 commented 4 years ago

Dear Ilsa,

I am sorry. R version 4.0 and the new Bioconductor release did not fix this problem. The problem exists because Rsamtools can only hold a limited number of sequence reads in memory (large but not enough for very large bam files). I am working on a new implementation of XenofilteR that will solve this problem and limits memory usage.

However, this might take some time to implement. I will announce a new version on the Github page once it is finished.

Kind regards, Oscar

On 3 Jun 2020, at 06:43, Ilsa Coleman notifications@github.com<mailto:notifications@github.com> wrote:

Just curious whether this has been fixed with R version 4.0? I have some very large bam files to filter and am getting this error (which never came up previously with smaller files.) Thanks, Ilsa

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/PeeperLab/XenofilteR/issues/7#issuecomment-637948692, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7X5RCKFSHY22BFVNEZFX3RUXIH5ANCNFSM4HY7H6SQ.

imcoleman commented 4 years ago

Thanks for the update!

Akazhiel commented 3 years ago

Any updates on the implementation of long vectors support?

wrongbong commented 3 years ago

I am facing the same problem too. I have 2 large BAM files with 140 M and 340 M reads respectively which are giving me the error. Any updates on the new release of XenofilteR? Thanks,

sun8841 commented 3 years ago

I'm facing the same issue. Are there any updates on this long vector issue? Thanks.

RoelKluin commented 2 years ago

This is not entirely a XenofilteR issue. One way to circumvent the issue is by using the R version that works, maybe via conda: https://anaconda.org/nki-avl/xenofilter?

npatel-ah commented 2 years ago

Dear Apurva, Unfortunately the limit originates in the Rsamtools package. The Bioconductor team is working on a new version of Rsamtools to solve this issue but it is not available yet and I do not know when it will be. A solution would be to split your fastq files in smaller fastq files, map each to mouse and human and run XenofilteR. After that you can merge the bam files again. It is not an ideal situation, I know. Another option would be to use the Perl implementation from Roel Kluin: https://github.com/PeeperLab/XenofilteR/tree/original/original Without the Rsamtools support for long vectors I do not see an easy solution in the XenofilteR package. I’ll try to dig into the error once more, see if I can come up with another (easier) solution. Best Oscar On 22 Oct 2019, at 11:31, ApurvaG05 notifications@github.com<mailto:notifications@github.com> wrote: Hi, I have the same issue while running XenofilteR. Is there a fix to this? Have you identified a solution and got it to run successfully ? Thanks in advance. -Apurva — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#7?email_source=notifications&email_token=AB7X5RHE7NMDFGFR4ZAG4J3QP3B7NA5CNFSM4HY7H6S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB5DKVY#issuecomment-544879959>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB7X5RDUGT5UWMBKAO7WMF3QP3B7NANCNFSM4HY7H6SQ.

Hello, I have huge WGS samples (~200GB) , splitting and running wouldn't be ideal because I am starting with the Bam file

So opted for perl script option. Any downside to using the perl script, it seems that it's a quite an old version. As well I tried running it but got syntax error ? -> Fixed the error by changing line 137 From } elsif (($balance->[0] < 0) && ($balance->[1] < 0){ To } elsif (($balance->[0] < 0) && ($balance->[1] < 0)) {

Also wondering what's difference in the conda version that would make it work. I see that the conda version include R3.5.

This is not entirely a XenofilteR issue. One way to circumvent the issue is by using the R version that works, maybe via conda: https://anaconda.org/nki-avl/xenofilter?

Thanks

RoelKluin commented 2 years ago

The perl script was my original version, but perl also stores reads in memory, which is an issue for huge bam files. Oscar wrote the R version, which was tested and published. The rust version I later wrote to process with little memory requirements, it depends on alignment (pairs) order, which should be in fastq order - raw alignment output before any sorting, for both graft and host. This version is not as well tested, and I have local changes on the NKI, but possibly not working yet, I developed for a particular use case. The on-line version does seem to build. Possibly it just works.

Maybe it's wise, if you want to try, to test 1M reads first; zcat $fq1 | head -n 4000000 | gzip --fast > 1.fq.gz similarly for read 2 if PE. Just to test whether the output is ok. It seems I did already add the --filtered-reads option which you can use to write filtered reads, and show clearly in IGV if the graft reads are correctly excluded. I may also do some tests, next week, if I have time.

If you already have coordinate sorted host and graft alignments, then you can also name sort those, however, samtools collate is not enough; both alignments need to have the same record order. Also both alignments should contain all records, including unmapped.

The rust version is included in the rust branch on github. Then go to subdirectory xenofilters/ - make sure you have rustup set up. cargo build --release the binary is: target/release/xenofilter, but you may want to run bwa_mem.sh

I added the bwa_mem.sh script for the ordering, I just added one change to write the filtered host output as well, but you may want to increase memory for sorting. I believe hisat2 should work similarly, but would require changes in the alignment commands.

Hope this just works, otherwise let me know, Roel

npatel-ah commented 2 years ago

Hello Roel,

Thank you for thorough information. I understood that Perl is memory intense thus won't solve the issues, and when I tested the perl script there was another error likely due to memory.

I have only human mapped bam, so what I ended up doing was to convert the bam to Fastq, split the files and then ran Xenograft R version on the split Fastq. Which worked for now.

Appreciate all your help.