Public-Health-Bioinformatics / cpo-pipeline

An analysis pipeline for the purpose of investigating Carbapenemase-Producing Organisms.
MIT License
1 stars 2 forks source link

Provide header for plasmid output #49

Closed dfornika closed 5 years ago

dfornika commented 5 years ago

The final_plasmid.tsv output file doesn't include a header. This makes it difficult to understand what each field represents.

dfornika commented 5 years ago

We do provide a header in the final_outputs.tsv file, so we could do something similar for final_plasmids.tsv

The final_outputs.tsv header is set up here:

https://github.com/Public-Health-Bioinformatics/cpo-pipeline/blob/8ed390d68ded30703b5bf03aec6a89b713ca9992/cpo_pipeline/pipeline.py#L213

The final_plasmids.tsv file is prepared here:

https://github.com/Public-Health-Bioinformatics/cpo-pipeline/blob/8ed390d68ded30703b5bf03aec6a89b713ca9992/cpo_pipeline/plasmids/pipeline.py#L389

It looks like there is some header info prepared but it doesn't seem to be written to the file. I'm not sure why that is.

DiDigsDNA commented 5 years ago

I'll take a look at this today.

DiDigsDNA commented 5 years ago

A list of headers to be output is defined forfinal_plasmids.tsvhere: https://github.com/Public-Health-Bioinformatics/cpo-pipeline/blob/8ed390d68ded30703b5bf03aec6a89b713ca9992/cpo_pipeline/plasmids/pipeline.py#L378

but this output file actually contains 9 columns:

(cpo_pipeline) [deisler@sabin BC19A237A]$ head final_plasmid.tsv 
BC19A237A   pBC13Kox003_2   circular    66722   66722   1.0 0   blaKPC-3    IncN

The sequence id is written first but the code for writing rows is more complicated...

writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
        for candidate in custom_candidates:
            f.write(args.sample_id + '\t')
            # Truncate floats to 4 digits
            writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in candidate.items()})

than how rows are written to the `final_outputs.tsv' file:

with open(final_output_path, 'w+') as f:
        writer = csv.DictWriter(f, fieldnames=final_outputs_headers, delimiter='\t')
        writer.writeheader()
        writer.writerow(final_outputs)

I think all we need to do is add a sample_id field to fieldnames

fieldnames = [
           'sample_id'
            'accession',
            'circularity',
            'plasmid_length',
            'bases_above_minimum_depth',
            'percent_above_minimum_depth',
            'snps',
            'allele',
            'incompatibility_group'
        ]

and a writer.writeheader() statement:

writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
        writer.writeheader()
        if custom_best_candidate:
            f.write(args.sample_id + '\t')
            # Truncate floats to 4 digits
            writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in custom_best_candidate.items()})

Do you agree?

dfornika commented 5 years ago

Yes, that looks correct to me.

DiDigsDNA commented 5 years ago

Oh, tricky-dicky! I'm almost seeing double:

 with open(plasmid_output_final, 'w+') as f:
        fieldnames = [
            'accession',
            'circularity',
            'plasmid_length',
            'bases_above_minimum_depth',
            'percent_above_minimum_depth',
            'snps',
            'allele',
            'incompatibility_group'
        ]
        writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
        if custom_best_candidate:
            f.write(args.sample_id + '\t')
            # Truncate floats to 4 digits
            writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in custom_best_candidate.items()})

    with open(plasmid_output_summary, 'w+') as f:
        fieldnames = [
            'accession',
            'circularity',
            'plasmid_length',
            'bases_above_minimum_depth',
            'percent_above_minimum_depth',
            'snps',
            'allele',
            'incompatibility_group'
        ]
        writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
        for candidate in custom_candidates:
            f.write(args.sample_id + '\t')
            # Truncate floats to 4 digits
            writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in candidate.items()})

May have to revisit plasmid output summary to see if that needs tweaking as well.

DiDigsDNA commented 5 years ago

The custom_plasmid.txt output file also needs headers printed as well as an additional header for sample_id.

I don't know if you want to create a separate issue for this but I'm was going to address this in the pipeline.py script before I run the pipeline with input data (because running the pipeline takes so long).

dfornika commented 5 years ago

Sure, we can add headers to both files in this issue.

dfornika commented 5 years ago

Fixed by #50