BBaloglu / ASHURE

Python-based pipeline for analyzing Nanopore sequencing metabarcoding data
GNU Lesser General Public License v3.0
18 stars 3 forks source link

MSA file generation and primer identification #2

Open IainPerry opened 3 years ago

IainPerry commented 3 years ago

Hi, Two potential issues I'm struggling with. 1: The MSA step is producing millions of files, one for each potential fragment. This is really hampering performance and taking up lots of space. 1gb of nanopore sequencing turning into 50gb of files. I'm not sure if this is deliberate or something I'm doing wrong. 2: I'm using ARTIC primers and during the search for forward and reverse it comes back with "error: Fwd and Rev primers were not found in reads". Not sure what could be causing this.

BBaloglu commented 3 years ago

Hi @IainPerry, could you please post your primers and share with us some sample data, so we can help test it for you? The MSA step should not amplify the data like that. Could you also give us a bit more details on your experimental design? For instance, are you working on concatemer data?

Also, what is the size of your primers? Our pipeline ASHURE uses minimap2 for primer search, which works efficiently for primers longer than 30 bp. It means that minimap2 fails when searching for primers shorter than 30bp. It could be the case for you. We are working on integrating other aligners that can handle short primer search better. In the meantime, we suggest that you combine your primers (please find the detailed description here: https://github.com/BBaloglu/ASHURE/issues/1) and use the combined version as your forward and reverse primers for primer search or design longer primers for your experiment.

Best, Bilge

IainPerry commented 3 years ago

Hi Bilge, Thank you for your reply, I'm using primers for ARTIC V3 listed here https://github.com/artic-network/primer-schemes/blob/master/nCoV-2019/V3/nCoV-2019.tsv They are all under 30bp so that seems highly likely the cause of that. For now I'll try and try increase the primer lengths from the amplicon and give that a go. The current primers shouldn't affect detection of covid variants. ARTIC amplifies 400bp regions across the covid genome and then overlaps amplification from a second pool to cover primer regions. Do you think it is better to build two pseudoref libraries (-fs 250-500) , or should building with -th 250 work?

As for the MSA step, the increase in data I think is a result of each detected fragment being put in its own file. So I end up with a folder with 300 million files in it. The data is trying to amplify covid variants from wastewater. This makes it a bit more considerably longer than a COI gene but perhaps not unreasonably long for this.

Best Iain