RAHenriksen / NGSNGS

NGSNGS: Next generation simulator for next generation sequencing data
46 stars 4 forks source link

amplicon mode? #28

Closed grenaud closed 2 months ago

grenaud commented 1 year ago

Hi Rasmus, have you had time to look at the amplicon mode i.e. FASTA in and FASTQ out with exactly the same sequences?

ANGSD commented 1 year ago

Rasmus is traveling, so maybe I should implement it. Since the input is fa and fq I think the bottleneck will be the input file reading, so I doubt it will be able to take advantage of the multithreading efficiently.

On 24 Mar 2023, at 22.58, Gabriel Renaud @.***> wrote:

Hi Rasmus, have you had time to look at the amplicon mode i.e. FASTA in and FASTQ out with exactly the same sequences?

— Reply to this email directly, view it on GitHub https://github.com/RAHenriksen/NGSNGS/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3VQAQITUIXATGHZK6DW5YKHDANCNFSM6AAAAAAWHCKJGM. You are receiving this because you are subscribed to this thread.

grenaud commented 1 year ago

yes if you can that would be great! thanks Thorfinn!

grenaud commented 1 year ago

Have you had time to look at this?

grenaud commented 1 year ago

Hi both, just to check if you have had time to look at this feature?

grenaud commented 1 year ago

Hello! Have you had time to look at this feature?

RAHenriksen commented 1 year ago

Hi Gabriel,

Unfortunately, neither I nor thorfinn will have time to look into this, until January 2024. So there wont be any major updates in regards to this functionality within the next three months.

I'll let you know once it is done!

RAHenriksen commented 2 months ago

Hi Gabriel,

I have now added an amplicon functionality, it is still in its initial phase, so at the moment it can add deamination, stochastic indels, and nucleotide substitution from the misincorporation files. The input files and parameters are similar to when running the ngsngs command.

Let me know if you are going to use it and if you have suggestions / improvements / identify bugs etc.

grenaud commented 2 months ago

Hi Rasmus,

Thank you for adding this! Where we would require is something where only simulate sequencing errors, would that be possible?

Gabriel

RAHenriksen commented 2 months ago

I can within the next few days look further into adding some of the similar functionalities as currently exist for the reference based simulations with ngsngs

-q1 and -2 given quality profile. -qs fixated quality score so each nucleotide have a fixed sequencing error probability

However these two suggestions would only work for input fasta files.

For already existing fastq or sam/bam files they should already contain sequencing errors. But potentially we could add additional random noise to the sequencing reads.

/Rasmus

grenaud commented 2 months ago

I mean you have an input of fasta files with adapters representing the DNA sequences that are bound to the flow cell, they all have the same length and we just need to have sequencing errors according to certain sequencer profile.

RAHenriksen commented 2 months ago

Yes good, i can add this functionality within the next week.

grenaud commented 2 months ago

amazing! let me know :-)

RAHenriksen commented 2 months ago

Hi Gabriel,

I have now added the functionalities to simulate nucleotide quality scores both from a quality profile and a fixated score. And if you're only interested in quality scores you can also disable the sequencing errors.

You can check the help page, and one example is seen below

./amplicon --amplicon Test_Examples/Amplicon_in.fa --format fq -q Test_Examples/AccFreqL150R1.txt --output Amplicon_seqerr

For all reads a modification vector is generated indicating the alterations to the DNA sequence, and the modification vector is further explained in the README file.

Again this is an initial version so if you identify any potential problems please let me know.