most input reads mysteriously discarded from xenome output

fpbarthel commented 6 years ago

I am trying to use Xenome to filter xenograft mouse reads from graft human reads. I am running into the issue that xenome classify finishes in seconds and returns kilobyte size output files without any discernible error message. Most the the reads seem to have been discarded and I can't figure out where they have gone or why this may have happened.

Is there any reason you can think of why this may have happened? I am using xenome 1.0.0.

This is the command I ran:

xenome classify --tmp-dir /fastscratch/XXX -v -P index --pairs -i /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R1_001.fastq.gz -i /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R2_001.fastq.gz --output-filename-prefix /XXX/86718-3 --graft-name human --host-name mouse --log-file logfile > statsfile

This is the output to the log file:

Tue Oct 10 11:05:14 2017    info    opening buffer 0 /fastscratch/XXX/1507647914-50853-0-classbuf-0
Tue Oct 10 11:05:14 2017    info    performing 1 pass
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_neither_1.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_neither_2.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_both_1.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_both_2.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_ambiguous_1.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_ambiguous_2.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_human_1.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_human_2.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_mouse_1.fastq
Tue Oct 10 11:05:14 2017    info    writing to /XXXX/86718-3/86718-3_mouse_2.fastq
Tue Oct 10 11:05:14 2017    info    pass 0
Tue Oct 10 11:05:18 2017    info    parsing sequences from /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R1_001.fastq.gz
Tue Oct 10 11:05:18 2017    info    parsing sequences from /XXXX/Sample_17-86718-3/17-86718-3_S7_L002_R2_001.fastq.gz
Tue Oct 10 11:09:18 2017    info    total elapsed time: 244.56612205505371
Tue Oct 10 11:09:18 2017    info    total elapsed time: 244.57243013381958

This is the stdout:

Statistics
B   G   H   M   count   percent class
0   0   0   0   250 85.034  "neither"
0   0   0   1   0   0   "both"
0   0   1   0   22  7.48299 "definitely mouse"
0   0   1   1   1   0.340136    "probably mouse"
0   1   0   0   15  5.10204 "definitely human"
0   1   0   1   2   0.680272    "probably human"
0   1   1   0   0   0   "ambiguous"
0   1   1   1   0   0   "ambiguous"
1   0   0   0   0   0   "both"
1   0   0   1   0   0   "probably both"
1   0   1   0   0   0   "definitely mouse"
1   0   1   1   1   0.340136    "probably mouse"
1   1   0   0   0   0   "definitely human"
1   1   0   1   1   0.340136    "probably human"
1   1   1   0   0   0   "ambiguous"
1   1   1   1   2   0.680272    "ambiguous"

Summary
count   percent class
18  6.12245 human
24  8.16327 mouse
0   0   both
250 85.034  neither
2   0.680272    ambiguous

The input fastq files are large:

total 6.6G
-rwxr-xr-- 1 XXX XXX 2.8G Oct  6 13:04 17-86718-3_S7_L002_R1_001.fastq.gz
-rwxr-xr-- 1 XXX XXX 3.1G Oct  6 13:03 17-86718-3_S7_L002_R2_001.fastq.gz

The output files are tiny:

-rw-r--r-- 1 XXX XXX  445 Oct 10 11:05 86718-3_ambiguous_1.fastq
-rw-r--r-- 1 XXX XXX  445 Oct 10 11:05 86718-3_ambiguous_2.fastq
-rw-r--r-- 1 XXX XXX    0 Oct 10 11:05 86718-3_both_1.fastq
-rw-r--r-- 1 XXX XXX    0 Oct 10 11:05 86718-3_both_2.fastq
-rw-r--r-- 1 XXX XXX 4.0K Oct 10 11:05 86718-3_human_1.fastq
-rw-r--r-- 1 XXX XXX 4.0K Oct 10 11:05 86718-3_human_2.fastq
-rw-r--r-- 1 XXX XXX 1.9K Oct 10 11:09 86718-3.log
-rw-r--r-- 1 XXX XXX 5.3K Oct 10 11:05 86718-3_mouse_1.fastq
-rw-r--r-- 1 XXX XXX 5.3K Oct 10 11:05 86718-3_mouse_2.fastq
-rw-r--r-- 1 XXX XXX  55K Oct 10 11:09 86718-3_neither_1.fastq
-rw-r--r-- 1 XXX XXX  55K Oct 10 11:09 86718-3_neither_2.fastq
-rw-r--r-- 1 XXX XXX  631 Oct 10 11:09 86718-3.stats.txt

The index took surprising little time to create, and also seems a bit small perhaps, but I don't know what it should be....

-rwxrwx--- 1 XXX XXX   24 Oct  9 14:25 index-both.header
-rwxrwx--- 1 XXX XXX  21M Oct  9 14:25 index-both.kmers-d0
-rwxrwx--- 1 XXX XXX 6.4M Oct  9 14:25 index-both.kmers-d1
-rwxrwx--- 1 XXX XXX   56 Oct  9 14:25 index-both.kmers.header
-rwxrwx--- 1 XXX XXX  32M Oct  9 14:25 index-both.kmers.high-bits
-rwxrwx--- 1 XXX XXX 371M Oct  9 14:25 index-both.kmers.low-bits.lwr
-rwxrwx--- 1 XXX XXX 186M Oct  9 14:25 index-both.kmers.low-bits.upr
-rwxrwx--- 1 XXX XXX  24M Oct  9 14:29 index-both.lhs-bits
-rwxrwx--- 1 XXX XXX  24M Oct  9 14:29 index-both.rhs-bits
-rwxrwx--- 1 XXX XXX   24 Oct  9 14:23 index-graft.header
-rwxrwx--- 1 XXX XXX 3.1M Oct  9 14:23 index-graft.kmers-d0
-rwxrwx--- 1 XXX XXX 4.5M Oct  9 14:23 index-graft.kmers-d1
-rwxrwx--- 1 XXX XXX   56 Oct  9 14:23 index-graft.kmers.header
-rwxrwx--- 1 XXX XXX  20M Oct  9 14:23 index-graft.kmers.high-bits
-rwxrwx--- 1 XXX XXX 187M Oct  9 14:23 index-graft.kmers.low-bits.lwr
-rwxrwx--- 1 XXX XXX  94M Oct  9 14:23 index-graft.kmers.low-bits.upr
-rwxrwx--- 1 XXX XXX   24 Oct  9 14:24 index-host.header
-rwxrwx--- 1 XXX XXX 2.4M Oct  9 14:24 index-host.kmers-d0
-rwxrwx--- 1 XXX XXX 4.9M Oct  9 14:24 index-host.kmers-d1
-rwxrwx--- 1 XXX XXX   56 Oct  9 14:24 index-host.kmers.header
-rwxrwx--- 1 XXX XXX  20M Oct  9 14:24 index-host.kmers.high-bits
-rwxrwx--- 1 XXX XXX 184M Oct  9 14:24 index-host.kmers.low-bits.lwr
-rwxrwx--- 1 XXX XXX  92M Oct  9 14:24 index-host.kmers.low-bits.upr
-rwxrwx--- 1 XXX XXX 6.9K Oct  9 14:29 xenome-index_stdout.txt

This is the command I used to create the index:

xenome index -M 24 -T 8 -P index -H Mus_musculus.GRCm38.dna_rm.primary_assembly.fa.gz -G Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz

fpbarthel commented 6 years ago

I was able to track down a copy of xenome 1.0.1 via a co-worker and am having the same issue with that version.

Deguerre commented 6 years ago

Thank you! That helps a lot.

I've been looking for an incompatibility, but it sounds like it's actually a longstanding bug.

Everyone has probably worked out by now that you can make the bug go away, at the cost of it taking longer, by running the "index" command single-threaded:

xenome index -T 1

fpbarthel commented 6 years ago

So, that did not seem to fix the problem.

I tried running xenome index -T 1 and xenome index -T 1 -M 24 as per your suggestion and it does not seem to do anything.

The most interesting part is that the file size of the output index files does not change after the program starts outputting lines of text, ie.

XX  XX       XX       0.0
...
...
138698  158334981       194003348       81.6146  
139591  159383557       194003348       82.1551  
140479  160432135       194003348       82.6955  
141363  161480711       194003348       83.236   
142378  162529283       194003348       83.7765  
143564  163577862       194003348       84.317   
144345  164626437       194003348       84.8575  
145444  165675015       194003348       85.398   
146574  166723590       194003348       85.9385  
147543  167772165       194003348       86.479   
148680  168820741       194003348       87.0195  
149831  169869318       194003348       87.56
150791  170917894       194003348       88.1005  
151746  171966469       194003348       88.641   
152629  173015046       194003348       89.1815  
153548  174063622       194003348       89.722   
154478  175112205       194003348       90.2625  
155617  176160772       194003348       90.803   
156774  177209352       194003348       91.3435  
157654  178257925       194003348       91.8839  
158474  179306501       194003348       92.4244  
159568  180355077       194003348       92.9649  
160592  181403651       194003348       93.5054  
161414  182452229       194003348       94.0459  
162274  183500805       194003348       94.5864  
163365  184549382       194003348       95.1269  
164541  185597957       194003348       95.6674  
165461  186646534       194003348       96.2079  
166521  187695115       194003348       96.7484  
167539  188743688       194003348       97.2889  
168645  189792261       194003348       97.8294  
169635  190840838       194003348       98.3699  
170640  191889419       194003348       98.9104  
171758  192937996       194003348       99.4509  
172809  193986571       194003348       99.9914  
172823  194003348       194003348       100

So the file size of the output files did not change between 0 and 100 being printed to stdout, even tho at least 40 minutes to an hour passed.

EDIT: I am now trying to run xenome index using different reference genomes, previously I had been using Ensembl "dna_rm" genomes because it seems appropriate to drop repeat regions for my use but I am now trying with "dna_sm" which does not drop these and instead marks them using small case. See: ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/dna/README

fpbarthel commented 6 years ago

UPDATE: Using the "dna_sm" references from Ensembl the size of the index files is different, but the output index files also do not change during/after xenome index starts outputting text to stdout.

Subsequently running xenome classify also results in truncated fastq files as in the initial post, without returning any useful error messages. I'm at a loss here what is happening. Could this be related to the reference genome? Is xenome index compatible with Ensemble reference genomes?

Deguerre commented 6 years ago

The problem is a concurrency issue. And it's a nasty one. This is why it works on short reference genomes: there isn't enough work for multithreading to be an issue.

fpbarthel commented 6 years ago

I have a co-worker who has said he has run xenome without issue using human/mouse reference genomes (not sure exactly which ones) in the past (two years or so ago).

I've now tried running xenome index with -T 1 and -T 8, could it potentially work with other values of -T?

Are you certain that this is a problem that is due to xenome index and not xenome classify? It seems so to me because there does not seem to be any change in the output files from xenome index after the first 5 minutes of it running, even though it goes on for another hour or so.

Do you think it could work if I tried running this on a different server? Our cluster is running Linux XXX 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10 18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux on centos-release-6-7.el6.centos.12.3.x86_64

Is there anything else I could do that may resolve this? Is this a bug with the program or could recompiling it fix the issue? If the former, is there any outlook for a fix?

Apologies for the many questions, thanks for your help!

Deguerre commented 6 years ago

"I've now tried running xenome index with -T 1 and -T 8, could it potentially work with other values of -T?"

If the bug is what we think it is, higher values should make it go faster and also increase the probability of xenome hanging.

"Are you certain that this is a problem that is due to xenome index and not xenome classify?"

Now that you mention it, no, I'm not certain. The only reason I said this is that nobody had reported an issue with classify yet, and we hadn't observed an issue with classify yet.

"Do you think it could work if I tried running this on a different server?"

It probably won't make a difference.

"Is this a bug with the program or could recompiling it fix the issue? If the former, is there any outlook for a fix?"

It's a bug. We are still working on it when we can. And, of course, we would eagerly accept patches!

fpbarthel commented 6 years ago

Weird. A co-worker tipped me to uncompress the fastq (gunzip file.fq) and try running xenome classify again... To my surprise xenome now seems to work without issues. Hope this helps pinpoint the problem.

fpbarthel commented 6 years ago

Same issue with xenome index, decompressing FASTA files prior to running seems to allow the program to work as intended. It seems to me that xenome has issues dealing with gzipped fasta/fastq files. Perhaps this is an easier fix than finding some nasty concurrency issues of unknown origin?

data61 / gossamer

most input reads mysteriously discarded from xenome output #14