arshajii / ema

Fast & accurate alignment of barcoded short-reads
http://ema.csail.mit.edu
MIT License
32 stars 7 forks source link

Error during preproc #5

Closed vladsavelyev closed 6 years ago

vladsavelyev commented 6 years ago

Hi,

I downloaded WGS 10x data from https://support.10xgenomics.com/genome-exome/datasets/2.1.4/NA12878_WGS_v2, subset it with seqtk sample to 10k reads:

seqtk sample -s100 NA12878_WGS_v2_S1_L001_R1_001.fastq 10000 > subset10k_L001_R1_001.fastq
seqtk sample -s100 NA12878_WGS_v2_S1_L001_R2_001.fastq 10000 > subset10k_L001_R2_001.fastq

And taken barcodes from the tenkit package from 10x: https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/barcodes/4M-with-alts-february-2016.txt

Then I tried to run the pipeline. The count command ran fine (though on first attempt it crashed on compressed input before I figured .fastq.gz is not supported):

cat subset10k_L001_R*_001.fastq | ema count -1 - -w 4M-with-alts-february-2016.txt -o counts_file

However the following preproc command dies with the following error:

cat subset10k_L001_R*_001.fastq | ema preproc -1 - -w 4M-with-alts-february-2016.txt -c counts_file -n 2
> ema: src/preprocess.c:389: preprocess_fastqs: Assertion `__extension__ ({ size_t __s1_len, __s2_len; (__builtin_constant_p (id1) && __builtin_constant_p (id2) && (__s1_len = strlen (id1), __s2_len = strlen (id2), (!((size_t)(const void *)((id1) + 1) - (size_t)(const void *)(id1) == 1) || __s1_len >= 4) && (!((size_t)(const void *)((id2) + 1) - (size_t)(const void *)(id2) == 1) || __s2_len >= 4)) ? __builtin_strcmp (id1, id2) : (__builtin_constant_p (id1) && ((size_t)(const void *)((id1) + 1) - (size_t)(const void *)(id1) == 1) && (__s1_len = strlen (id1), __s1_len < 4) ? (__builtin_constant_p (id2) && ((size_t)(const void *)((id2) + 1) - (size_t)(const void *)(id2) == 1) ? __builtin_strcmp (id1, id2) : (__extension__ ({ const unsigned char *__s2 = (const unsigned char *) (const char *) (id2); register int __result = (((const unsigned char *) (const char *) (id1))[0] - __s2[0]); if (__s1_len > 0 && __result == 0) { __result = (((const unsigned char *) (const char *) (id1))[1] - __s2[1]); if (__s1_len > 1 && __result == 0) { __result = (((const unsigned char *) (const char *) (id1))[2] - __s2[2]); if (__s1_len > 2 && __result == 0) __result = (((const unsigned char *) (const char *) (id1))[3] - __s2[3]); } } __result; }))) : (__builtin_constant_p (id2) && ((size_t)(const void *)((id2) + 1) - (size_t)(const void *)(id2) == 1) && (__s2_len = strlen (id2), __s2_len < 4) ? (__builtin_constant_p (id1) && ((size_t)(const void *)((id1) + 1) - (size_t)(const void *)(id1) == 1) ? __builtin_strcmp (id1, id2) : (__extension__ ({ const unsigned char *__s1 = (const unsigned char *) (const char *) (id1); register int __result = __s1[0] - ((const unsigned char *) (const char *) (id2))[0]; if (__s2_len > 0 && __result == 0) { __result = (__s1[1] - ((const unsigned char *) (const char *) (id2))[1]); if (__s2_len > 1 && __result == 0) { __result = (__s1[2] - ((const unsigned char *) (const char *) (id2))[2]); if (__s2_len > 2 && __result == 0) __result = (__s1[3] - ((const unsigned char *) (const char *) (id2))[3]); } } __result; }))) : __builtin_strcmp (id1, id2)))); }) == 0' failed.
[1]    1448 broken pipe  cat subset10k_L001_R*_001.fastq |
       1449 abort        ema preproc -1 - -w 4M-with-alts-february-2016.txt -c counts_file -n 2

Does it have something to do with subsetting the input?

Attaching a tarball with the inputs.

NA12878_WGS_10x_subset10k.gz

vladsavelyev commented 6 years ago

Just as I submitted the issue, figured that for preproc, I have to pass the fastq files separately with -1 and -2 commands :) Works now.