BBaloglu / ASHURE

Python-based pipeline for analyzing Nanopore sequencing metabarcoding data
GNU Lesser General Public License v3.0
18 stars 3 forks source link

Any issue with ASHURE #1

Closed BBaloglu closed 3 years ago

BBaloglu commented 4 years ago

Hi everyone,

If you have any issues with the code, please post them here. We will get back to you as soon as we can. Thanks!

pedroemanuelvieira commented 4 years ago

Dear BBaloglu. Seems very promising this tool. I am having some problems not sure related to my files or with the pipeline. I have COI_fwd, COI_rev, 18S_fwd and 18S_rev in the same run..

With my data I am having a problem. The error in file1 (annex) is "adding ONT header info". Did you encounter this error before? I also tried some data I found from a paper online and I am having other issues with the database. I tried to use a database (also I tried with my data) and an error persists (file2 in annex) Do you have any suggestions?

It would be possible to add a fastq that worked here?

Thanks Pedro first second

zchen15 commented 4 years ago

Hi Pedro,

Can you provide a few lines of the fastq file so I can try and debug this? Its possible the fastq header from your run is different from our minION runs. I can fix this once I know what the header line looks like.

Can you also attach the log file generated by pipeline?

-Zhewei

pedroemanuelvieira commented 4 years ago

Dear Zhewei. Thanks for your quick reply.

I tried in another linux and the examples provided worked (except the cluster). I think the problem was with some dependencies of panda. Although now I managed to upload the fastq file of my data, I still have some problems. I encounter the error if I use the primers (error 1- "INFO: warning no fwd and rev primers found") or the error 2 if I use db (NameError: name 'bpy' is not defined). I tried either with command line and with python.

I am pretty sure my reads have the primers.

Thanks in advance for any help that you can give me. Pedro error1.txt error2.txt

zchen15 commented 4 years ago

Hi Pedro,

Based on the log file of error1.txt, I can only guess at the following scenarios: 1) Your primer sequences are too short and minimap2 was not able to find any confident matches 2) Your primers were not provided in the correct orientation?

Based on the log file of error2.txt, it looks like bilge_pype.py was not import properly by the script? Is bilge_pype.py in the same directory as ashure.py?

I cannot tell exactly whats wrong with the code without the relevant primers.csv, ref_db.fa, or FAL_2.fq

Is it possible to share these on a collab notebook or attachment, so I can replicate the error myself.

pedroemanuelvieira commented 4 years ago

Dear Zhewei. Thanks for your feedback and for your support. We used mixed COI and 18S primers.

I tried again with the whole primer sequence and the error persists. It is in the minimap step. For sure I am missing something.

Here is the subsample I tried. https://drive.google.com/file/d/1VaucyKs-8BM2FS5X7wdHlarLwI9buqRo/view?usp=sharing

The primer: primers3.csv.gz

Once again, thanks for your help Pedro

PS: When I use my original order of the fastq file, it does not work. So I changed to match the fastq files you used. Example of the original order: @b0174f2c-8acf-41eb-ae65-55bbeb5b4d96 runid=f4d58d5083109061d5f20bebfbcfa553f28ca79b read=7 ch=269 start_time=2019-12-20T14:22:02Z flow_cell_id=FAL23141 protocol_group_id=jg_20_12_2019 sample_id=jg_20_12_2019 barcode=barcode13

zchen15 commented 4 years ago

Hi Pedro,

For error1, it seems your 18S and CO1 primers are correct, but the primers maybe too short to be found by minimap2. The primers we used in our paper were ~60bp long with barcode tags to better distinguish them from native sequences. I fixed a few lines in the code to make this error more clear.

For error2, I looked at the sample data and fixed a bug with the code. The load_file function in bilge_pype.py was referencing its own functions wrongly when trying to load the fasta file as the reference database.

The following line was able to run correctly now: ashure run -fq fastq/*.fastq -db ref_db.fa -o1 cons.csv -r fgs

If you find the ref_db.fa in the demo folder is insufficient, I suggesting trying the de novo clustering function on a subset of your data to generate a good reference database. Git pull to get the latest changes. This will enable you to run clustering directly on your fastq file. ashure clst -i FAL_2.fastq -iter 3 -r

Note this maybe a bit risky, since the clustering code was not meant for potentially concatemeric data. The clustering code is also sensitive to orientation of the gene, so you might end up with more clusters than you expected.

Finally, we updated our guide on pseudo reference generation. You might find the section with the tsne plots helpful in interpreting your dataset. https://bbaloglu.github.io/pages/ashure/pseudo_reference.html

Let me know how it goes!

pedroemanuelvieira commented 4 years ago

Thanks Zhewei for your support. It worked, although it still doesn't recognize the primers. All the best Pedro

BBaloglu commented 4 years ago

Hi @pedroemanuelvieira, you can access the open-access dataset here, in case you would like to try out with a larger portion of our fastq data: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA627498

Thanks for your feedback so far.

MaxRubinBlum commented 3 years ago

Dear BBaloglu and Zhewei,

is there a way around the short primer issue, like tweaking minimap2?

I am getting similar error to that described above:

ERROR: failed to open file './pseudo_refdb/results.paf' [M::main] Version: 2.12-r827 [M::main] CMD: minimap2 -c -k5 -w1 -s 20 -P -o ./pseudo_refdb/index.mmi ./pseudo_refdb/read1.fq ./pseudo_refdb/results.paf [M::main] Real time: 0.252 sec; CPU: 0.682 sec Traceback (most recent call last): File "/home/bioinf/bin/ashure/src/ashure.py", line 1127, in main() File "/home/bioinf/bin/ashure/src/ashure.py", line 1018, in main ref = generate_pseudo_refdb(primers, reads, block=10000, fs=config['prfg_fs'], config=config['prfg_config']) File "/home/bioinf/bin/ashure/src/ashure.py", line 68, in generate_pseudo_refdb out = match_primer_to_read(primers, df_i, workspace=workspace, config=config)
File "/home/bioinf/bin/ashure/src/ashure.py", line 306, in match_primer_to_read df1 = bpy.run_minimap2(reads, database[['id','sequence']], workspace, config, build_index=True, use_index=True) File "/home/bioinf/bin/ashure/src/bilge_pype.py", line 669, in run_minimap2 data = parse_PAF(outfile) File "/home/bioinf/bin/ashure/src/bilge_pype.py", line 567, in parse_PAF with open(fname,'r') as f: FileNotFoundError: [Errno 2] No such file or directory: './pseudo_refdb/results.paf'

We used short primers, such as 27F AGAGTTTGATCMTGGCTCAG and 1492R GGTTACCTTGTTACGACTT for 16S rRNA and 566F CAGCAGCCGCGGTAATTCC 1289R ACTAAGAACGGCCATGCACC for 18S rRNA amplicons.

Many thanks for your contribution and help. Maxim

zchen15 commented 3 years ago

Hey Maxim,

We haven't found other parameters that might work with short primers on minimap2.

The following 2 suggestions could work for you: 1) Combine the fwd and rev primers you used and use that as your fwd and rev primer sequences. 27F+1492R-fwd = AAGTCGTAACAAGGTAACC AGAGTTTGATCMTGGCTCAG 27F+1492R-rev = CTGAGCCA K GATCAAACTCT GGTTACCTTGTTACGACTT Just be mindful of the degenerate base pair characters you maybe using when obtaining the reverse complement sequence.

2) If you know the identity of the gene or related genes you are sequencing, you could try augmenting the primer sequences by adding the first 30-40bp and last 30-40bp of the gene to your fwd and rev primer sequences. There is a small risk you may lose some species, but you can look back at the reads in which concatemers were not detected, blast some of these reads, and see if anything new comes up. If you see something new, add it to the pseudo reference or augment the primer sequences again and rerun the pipeline on the subset of reads in which you did not get any hits.

This pipeline uses pseudo reference sequences to find the concatemers, and it was designed this way so the user can leverage what they know about the gene they are sequencing in order to process their NGS data.

Hope it helps!

MaxRubinBlum commented 3 years ago

Zhewei, many thanks, this is very helpful! Will give it a try.

KT0710 commented 3 years ago

Dear Zhewei and Baloglu,

I've installed ASHURE, and have been trying to run the demo, but the fastq file in the demo file seems damaged, and AHURE returns a error. The fastq file type property is as below, Link (broken) (inode/symlink) and the returned error is as below. FileNotFoundError: [Errno 2] No such file or directory: '/home/kohsuke/Documents/nanopore/yoshida/ashure/demo/fastq'

I'm a beginner in bioinformatics, and I apologize beforehand if this is due to any error on my side.

Kohsuke

zchen15 commented 3 years ago

Hi Kohsuke,

I recently moved the data files for the demo to an external repository and compressed them to bz2 format to save space.

The data files can be added back to the main repository as git submodules using the following commands. git submodule update --init --recursive

This will pull minimap2, spoa, and the demo data files into the main repository. The link should then work and it should point to ../external/data/COI/

I have not updated the jupyter notebooks yet, so you may need to update the code to point to the new *.fastq.bz2 The code has recently been updated so it can open bz2 compressed fastq files.

Let us know if you see anymore issues.

zchen15 commented 3 years ago

also if you did not install minimap2 or spoa to your root directory, I added options so you can do: python3 ashure.py -minimap2 ../bin/minimap2 -spoa ../bin/spoa run -h

This will let you point to these executables if you did not install them

KT0710 commented 3 years ago

Dear Zhewei,

Thank you for your helpful response. I've successfully ran the demo, and have proceeded to running my own analyses (by the way, the options for pointing to minimap2 and spoa was very convenient). However, I am getting the error below, when I run it. pid[241765] 2020-10-27 18:27:32.555 INFO: warning no fwd and rev primers found pid[241765] 2020-10-27 18:27:32.555 ERROR: generate_pseudo_refdb: no primers found in reads. I am firmly sure my reads have primers in them since I checked manually by searching the fastq file text. If you could give me any insight into how this might be happening, I would be very grateful.

Kohsuke

BBaloglu commented 3 years ago

Hi Kohsuke @KT0710, our pipeline ASHURE uses minimap2 for primer search, which works efficiently for primers longer than 30 bp. It means that minimap2 fails when searching for primers shorter than 30bp. We are working on integrating other aligners that can handle short primer search better. In the meantime, we suggest that you combine your primers (please find the detailed suggestion above) and use the combined version as your forward and reverse primers for primer search or design longer primers for your experiment.

KT0710 commented 3 years ago

Dear BBaloglu,

Thank you for your response. We do use primers that are approximately 20 bp of length, and will try some work arounds suggested above. I'm sorry I did not notice the above answer beforehand. (But I do hope shorter primers can be used in future versions of ASHURE, since metabarcoding primers of wider species groups tend to be short)

BBaloglu commented 3 years ago

Dear BBaloglu,

Thank you for your response. We do use primers that are approximately 20 bp of length, and will try some work arounds suggested above. I'm sorry I did not notice the above answer beforehand. (But I do hope shorter primers can be used in future versions of ASHURE, since metabarcoding primers of wider species groups tend to be short)

@KT0710 Agreed! For now, aligner parasail seems promising to map short primer reads, we are working on it. This project is no longer supported financially, so our pace could be slow. Stay tuned and subscribe to ASHURE for more updates.

Many thanks for your feedback. Bilge

BBaloglu commented 3 years ago

Hi @KT0710, we hope you had a chance to try our suggestion. I will go ahead and close this issue if that is the case. Feel free to open a new one for your questions on a specific task.

KT0710 commented 3 years ago

Hi @BBaloglu , I'm sorry I forgot to report back. I've tried your workaround, and it worked out wonderfully. I have no problem with your closing this issue. Thank you again!

BBaloglu commented 3 years ago

Thank you @KT0710, great to hear that our solutions were useful for your work. Let us know if you come across any other issues.

Cheers, Bilge