Duplicate reads to Unique read names...

fmalmeida / bacannot

Generic but comprehensive pipeline for prokaryotic genome annotation and interrogation with interactive reports and shiny app.

https://bacannot.readthedocs.io/en/latest/

GNU General Public License v3.0

96 stars 9 forks source link

Duplicate reads to Unique read names... #107

Closed Michaelijesse closed 9 months ago

Michaelijesse commented 9 months ago

Hello @fmalmeida I previously suggested you to add seqkit for renaming duplicate reads. But I faced complexity issues with seqkit processed subreads. So I changed to the following script for renaming pacbio subreads. Now its working fine.

gunzip -c file.fastq.gz | awk '{if(NR%4==1) $0=sprintf("@1_%d",(1+i++)); print;}' | gzip -c > another.fastq.gz

fmalmeida commented 9 months ago

Perfect. Many thanks for sharing. I will add this for the next release on the way of testing, v3.3.

On Wed, 20 Sep 2023 at 13.51, Michaelijesse @.***> wrote:

Hello @fmalmeida https://github.com/fmalmeida I previously suggested you to add seqkit for renaming duplicate reads. But I faced complexity issues with seqkit processed subreads. So I changed to the following script for renaming pacbio subreads. Now its working fine.

gunzip -c file.fastq.gz | awk '{if(NR%4==1) @.***_%d",(1+i++)); print;}' | gzip -c > another.fastq.gz

— Reply to this email directly, view it on GitHub https://github.com/fmalmeida/bacannot/issues/107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB26UYUFOGJI5C7REG23EQ3X3LKCPANCNFSM6AAAAAA47ZECNY . You are receiving this because you were mentioned.Message ID: @.***>

Michaelijesse commented 9 months ago

Do you have any published long read dataset for testing long read assembly and annotation. Most of the available genomes in NCBI were polished by Illumina short reads. I want both Published SRA and its published assembly without Illumina read polishing done.

fmalmeida commented 9 months ago

Maybe this could help you: https://www.nature.com/articles/s41592-022-01539-7

Never used them though, I generally test it comparing to the reference.

fmalmeida commented 9 months ago

Hi @Michaelijesse , I have added such functionality to the code that will be released soon.

To activate such deduplication command, one must add the following parameter, --enable_deduplication, to the command line.

Could you give it a try, using the dev branch to check if it works as desired?

In the meantime, I will start wrapping up the rest to make a release.

I will close the ticket by now, if not working as desired, or a change is needed, please re-open it.