BigelowLab / gorg-classifier

Produce taxonomic and functional annotations of shotgun metagenomes, metatranscriptomes and metaproteome sequences.
MIT License
4 stars 0 forks source link

Summary function not working with default GORG database #9

Open juliambrosman opened 2 years ago

juliambrosman commented 2 years ago

text in the summary file states: "custom databases without the expected columns are not supported in the summary function"

I ran:

nextflow run BigelowLab/gorg-classifier -profile docker --seqs 'data/metagenomes/SRR*1.fastq.gz' --cpus 8

the metagenome I used was downloaded like so:

fastq-dump --split-files --skip-technical -N 0 -X 300000 --gzip --readids SRR6507279
brwnj commented 2 years ago

I fixed the bug related to this issue in 23b3f7092d49b066f1a1d7ecf679bb677f3a6de2. I did have to update these read names though because they should be the same in R1 and R2.

Here's the name fix I did:

def readfq(fp): # this is a generator function
    last = None # this is a buffer keeping the last unprocessed line
    while True: # mimic closure; is it a bad idea?
        if not last: # the first record or a record following a fastq
            for l in fp: # search for the start of the next record
                if l[0] in '>@': # fasta/q header line
                    last = l[:-1] # save this line
                    break
        if not last: break
        name, seqs, last = last[1:].partition(" ")[0], [], None
        for l in fp: # read the sequence
            if l[0] in '@+>':
                last = l[:-1]
                break
            seqs.append(l[:-1])
        if not last or last[0] != '+': # this is a fasta record
            yield name, ''.join(seqs), None # yield a fasta record
            if not last: break
        else: # this is a fastq record
            seq, leng, seqs = ''.join(seqs), 0, []
            for l in fp: # read the quality
                seqs.append(l[:-1])
                leng += len(l) - 1
                if leng >= len(seq): # have read enough quality
                    last = None
                    yield name, seq, ''.join(seqs); # yield a fastq record
                    break
            if last: # reach EOF before reading enough quality
                yield name, seq, None # yield a fasta record instead
                break

if __name__ == "__main__":
    import sys
    for name, seq, qual in readfq(sys.stdin):
        name = name.rpartition(".")[0]
        print("@" + name, seq, "", qual, sep="\n")

Then

zcat SRR6507279_1.fastq.gz | python update_names.py | gzip > example/SRR6507279_1.fastq.gz
# etc.