MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 30 forks source link

Readgroup names #45

Closed rmonti closed 7 years ago

rmonti commented 7 years ago

Hi Mike,

I was wondering how ShortStack determines the names of the read groups (RG) for the output alignment files.

I did not specify any prefixes, and submitted a file with this path and name:

../fastq/BHXPU.11045.1.190494.TTAGGC.filter-SMRNA.fastq.gz

it somehow guessed that it should call the bam-file BHXPU.bam and then merge it into the merged_alignments.bam with RG=BHXPU, which is essentially what I wanted.

does it just take the name of the file up to the first period as read-group?

I could imagine cases where this would yield an error, e.g. if I submit a bunch of files that looked like this:

./reads.1.fastq ./reads.2.fastq

and so on...

So how are the names determined?

best,

Remo

MikeAxtell commented 7 years ago

Hi Remo.

When parsing filepaths of read files, ShortStack takes everything from the first '.' to the end of the path as a suffix (e.g. file extension). The remaining 'basename' of the initial files are kept as read-groups (and for other purposes).

Yes, the hypothetical situation you describe would cause issues. I never thought of it because I personally am strict with using '.' in files only for file extensions / suffices to describe the file format, but not for distinguishing names. But that is style, and obviously style varies between users.

I'll mark this as a bug to include a fix for in the next release.

On Mon, Jan 16, 2017 at 1:25 PM, rmonti notifications@github.com wrote:

Hi Mike,

I was wondering how ShortStack determines the names of the read groups (RG) for the output alignment files.

I did not specify any prefixes, and submitted a file with this path and name:

../fastq/BHXPU.11045.1.190494.TTAGGC.filter-SMRNA.fastq.gz

it somehow guessed that it should call the bam-file BHXPU.bam and then merge it into the merged_alignments.bam with RG=BHXPU, which is essentially what I wanted.

does it just take the name of the file up to the first period as read-group?

I could imagine cases where this would yield an error, e.g. if I submit a bunch of files that looked like this:

./reads.1.fastq ./reads.2.fastq

and so on...

So how are the names determined?

best,

Remo

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiWe3Bsr81WDOSy6r04wl7Wfk4ujCks5rS7X_gaJpZM4Lk2wc .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

MikeAxtell commented 7 years ago

To clarify, when parsing read file names, everything after the last forward slash / but before the first period . is considered the file's 'base name'.

MikeAxtell commented 7 years ago

Fixed in release 3.7