Closed dfornika closed 5 years ago
We do provide a header in the final_outputs.tsv
file, so we could do something similar for final_plasmids.tsv
The final_outputs.tsv
header is set up here:
The final_plasmids.tsv
file is prepared here:
It looks like there is some header info prepared but it doesn't seem to be written to the file. I'm not sure why that is.
I'll take a look at this today.
A list of headers to be output is defined forfinal_plasmids.tsv
here: https://github.com/Public-Health-Bioinformatics/cpo-pipeline/blob/8ed390d68ded30703b5bf03aec6a89b713ca9992/cpo_pipeline/plasmids/pipeline.py#L378
but this output file actually contains 9 columns:
(cpo_pipeline) [deisler@sabin BC19A237A]$ head final_plasmid.tsv
BC19A237A pBC13Kox003_2 circular 66722 66722 1.0 0 blaKPC-3 IncN
The sequence id is written first but the code for writing rows is more complicated...
writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
for candidate in custom_candidates:
f.write(args.sample_id + '\t')
# Truncate floats to 4 digits
writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in candidate.items()})
than how rows are written to the `final_outputs.tsv' file:
with open(final_output_path, 'w+') as f:
writer = csv.DictWriter(f, fieldnames=final_outputs_headers, delimiter='\t')
writer.writeheader()
writer.writerow(final_outputs)
I think all we need to do is add a sample_id
field to fieldnames
fieldnames = [
'sample_id'
'accession',
'circularity',
'plasmid_length',
'bases_above_minimum_depth',
'percent_above_minimum_depth',
'snps',
'allele',
'incompatibility_group'
]
and a writer.writeheader()
statement:
writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
writer.writeheader()
if custom_best_candidate:
f.write(args.sample_id + '\t')
# Truncate floats to 4 digits
writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in custom_best_candidate.items()})
Do you agree?
Yes, that looks correct to me.
Oh, tricky-dicky! I'm almost seeing double:
with open(plasmid_output_final, 'w+') as f:
fieldnames = [
'accession',
'circularity',
'plasmid_length',
'bases_above_minimum_depth',
'percent_above_minimum_depth',
'snps',
'allele',
'incompatibility_group'
]
writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
if custom_best_candidate:
f.write(args.sample_id + '\t')
# Truncate floats to 4 digits
writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in custom_best_candidate.items()})
with open(plasmid_output_summary, 'w+') as f:
fieldnames = [
'accession',
'circularity',
'plasmid_length',
'bases_above_minimum_depth',
'percent_above_minimum_depth',
'snps',
'allele',
'incompatibility_group'
]
writer = csv.DictWriter(f, fieldnames=fieldnames, delimiter='\t', extrasaction='ignore')
for candidate in custom_candidates:
f.write(args.sample_id + '\t')
# Truncate floats to 4 digits
writer.writerow({k:round(v,4) if isinstance(v,float) else v for k,v in candidate.items()})
May have to revisit plasmid output summary to see if that needs tweaking as well.
The custom_plasmid.txt
output file also needs headers printed as well as an additional header for sample_id.
I don't know if you want to create a separate issue for this but I'm was going to address this in the pipeline.py script before I run the pipeline with input data (because running the pipeline takes so long).
Sure, we can add headers to both files in this issue.
Fixed by #50
The
final_plasmid.tsv
output file doesn't include a header. This makes it difficult to understand what each field represents.