biomedicalinformaticsgroup / Sargasso

Sargasso disambiguates mixed-species high-throughput sequencing data.
http://biomedicalinformaticsgroup.github.io/Sargasso/
Other
8 stars 4 forks source link

sed extraction of read ids from sambamba stream slower than necessary #23

Closed s-heron closed 9 years ago

s-heron commented 9 years ago

The extraction is currently done with the following command in the /bin/filter_reads script:

START_IDS["${j}"]=$( sambamba view -t "${THREADS}" "${sorted_reads_prefix}"."${LARGER_SPECIES}".bam | sed -n "${ID_INDICES[${j}]}p; $(( ${ID_INDICES[${j}]} + 1 ))q" | awk '{print $1}' )

When run individually in the terminal, execution will cease (as specified) up reaching the next line after the target line. When run in the script, this seens to occur for the first instance but then subsequent instances wait for the sambamba file stream to finish, which takes quite a little while and seemingly ignores sed's 'q' quit command. Finding out what causes this behaviour difference could shave ~3 minutes or so off of the execution time (for 4 threads/blocks).