ctmrbio / BACTpipe

BACTpipe: An assembly and annotation pipeline for bacterial genomics
https://bactpipe.readthedocs.org
MIT License
20 stars 7 forks source link

Integrate renaming script pre sendsketch #129

Closed thorellk closed 3 years ago

thorellk commented 3 years ago

I would like to integrate a script renaming the headers of the assembly fasta files from shovill to short, standardized headers including the strain identifier so that they would look like

>strainX_contig1
>strainX_contig2

This would hopefully make them less error prone and also a more suitable as prokka output. This way we would not have to write out the shovill assembly output file and could only write the cleaned assembly file after the screen for contaminants process, which would also agree better with the new use-screened-contigs branch. The only problem is that the script that I have used prior to this is not written in python3, which I think it should be to minimize extra dependencies. Would any of you @abhi18av @emilio-r or @boulund be able to translate that to python3 for me? I called it a txt file since .py files could not be uploaded here

rename_fasta.txt

abhi18av commented 3 years ago

This way we would not have to write out the shovill assembly output file and could only write the cleaned assembly file after the screen for contaminants process ...

@thorellk , I do like this idea overall. We can integrate this once we've sure of the current code.

Regarding the script , here's the python3 version.

rename_fasta.py.txt

The only changes I've made to convert this script were to update the print statements, luckily the script is short :)

    print('Parsing {}'.format(args.input))
...

    print('Wrote {0} contigs to {1}.'.format(count, args.output))

Please let me know the if it works as expected.

thorellk commented 3 years ago

Works like a charm šŸ‘

abhi18av commented 3 years ago

P.S. You could try out the 2to3 tool which comes bundled with Python3 to make sure that there are not huge changes necessary for the upgrade.

abhi18av commented 3 years ago

@thorellk , just to confirm - you would like this rename_fasta.py to be included in the SCREEN_FOR_CONTAMINATION process right?

abhi18av commented 3 years ago

Am I correct in invoking the script like this

python3 rename_fasta.py --input A-salmonicida.contigs.fa --output A-salmonicida.contigs.renamed.fa 

I do understand the we can pass the --pre value like --pre strainX_contig which would create the same pattern as you mentioned earlier. However, I'd like to know the exact value for --pre which you would use?

Perhaps in the end it'd look like strain_mucosa_contig1 etc, a few examples would help :)

thorellk commented 3 years ago

Hi @abhi18av! Sorry for my delay in responding. Yes, I thought it could be good to have it as step one in the SCREEN_FOR_CONTAMINATION process since it hopefully may decrease the risk for sendsketch failing. The --pre pattern should be the same as pair_id_'contig', would that be possible? For the example above it would be A-salmonicida_contig

thorellk commented 3 years ago

Hi @abhi18av! Now we are back from holidays :) Should we take a Skype/Zoom some day later in the week to catch up a bit?

abhi18av commented 3 years ago

Hi @thorellk!

Sure, Iā€™m available as well. Please let me know of a good time to connect.

thorellk commented 3 years ago

Hi! Either today (I am available for another 4 hours), Thursday afternoon or Friday any time between 9-17 CET would work for me. How about you?

abhi18av commented 3 years ago

@thorellk, today I have a few backlog tasks lined up for this week - if possible I do prefer Thursday, anything between 14-17 CET works :)

Please feel free to mark the calendar and share the meeting link at abhi18avatoutlookdotcom

thorellk commented 3 years ago

Great, I sent an invite for a Zoom call Thursday 14.00 CET :)

abhi18av commented 3 years ago

Thanks! Looking forward to finalizing the pipeline soon šŸ˜Š

abhi18av commented 3 years ago

With this commit 7a6d00fa72f6628ee3b52b6a74db89801ab94d1b, the renaming script has been integrated.