epi2me-labs / wf-clone-validation

Other
24 stars 18 forks source link

Erroneous sequence reported in insert fasta file #55

Closed Nifaste closed 2 weeks ago

Nifaste commented 3 months ago

Ask away!

I have observed an error with the sequences of the inserts in the FASTA format: {{ alias }}.insert.fasta. I investigated the issue and it appears to originate from the following code in find_insert.py:

rev_comp = reverse_complement(whole_seq) strand_seq = {'-': rev_comp, '+': whole_seq} parse_seq = strand_seq[str(df['strand'][0])] final_seq = parse_seq[df['start'][0]::] + parse_seq[:df['end'][0]:] df['sequence'][0] = final_seq

The start and stop coordinates are not being adjusted when taking the reverse complement of the sequence.

mapping_inserts_to_ref

I fixed the issue with an update of the code

insert_seq=whole_seq[df['start'][0]::] + whole_seq[:df['end'][0]:] rev_comp = reverse_complement(insert_seq) strand_seq = {'-': rev_comp, '+': insert_seq} final_seq = strand_seq[str(df['strand'][0])]

Has anyone else experienced this? Is my bug fix correct?

sarahjeeeze commented 2 months ago

Hi, thanks for reporting this. We will investigate and amend if required and let you know once implemented.

sarahjeeeze commented 2 months ago

Hi, After some investigation I see seqkit amplicon which we use for getting this sequence changed (fixed) the way it reported start and end sequences in the bed file from 2.4.0 which has resulted in this bug in the workflow. It previously did output the reverse complement start and end points. see - https://github.com/shenwei356/seqkit/issues/367. We will amend with your fix. Thanks again for drawing our attention to this!

sarahjeeeze commented 3 weeks ago

Hi, we have now released the fix for this in the latest release.

sarahjeeeze commented 2 weeks ago

Closing as this is now fixed, let us know if you have further troubles