asrivathsan / miniBarcoder

DNA barcoding using MinION
2 stars 5 forks source link

Installing miniBarcoder as part of a Conda environment? #1

Closed Thomieh73 closed 5 years ago

Thomieh73 commented 5 years ago

Hi, Since me and my colleagues are working on a compute cluster it is convenient, to set up the miniBarcoder pipeline within a virtual environment. We have Anaconda3 installed, so I decide to install miniBarcoder in a conda environment with python 2.7.

Within the conda environment I then could install all the dependencies and I could call the different programs, such as graphmap and Racon without getting errors due to missing dependencies.

As the final step, I downloaded the master.zip file from the barcoder repository and wanted to install that within the conda environment. I added the scripts to the bin directory of the environment (part of the $PATH) and I tried calling the scripts. That did not work immediately.

It only worked by calling the scripts directly using the link to the bin directory, or by adding a shebank to the python scripts:

!/usr/bin/env python

That allowed me to call the scripts without needing to write the path to their storage location.

However, I am uncertain what to do with the scripts folder. Is it needed that the scripts in there are accessible by the main miniBarcoder pipeline scripts or not?

asrivathsan commented 5 years ago

Hi,

The main miniBarcoder.py script doesn't rely on other scripts. aacorrection.py needs correction_funcs.py. Scripts folder is required if you want to run the racon pipeline or to assess barcodes/ filter by ambiguity counts/ consolidate mafft/racon barcodes.

Thanks for sharing the experience with conda environment, this is useful to know. I will take a look at it at this end and update this to run with it.

Thomieh73 commented 5 years ago

Great, happy to read you like the conda set-up idea. Thanks for pointing out which scripts need the code in the scripts folder. That is good to know.

I will use the test data to check the installation I have so far and modify it if needed. That should not be to difficult.

Thomieh73 commented 5 years ago

Hi, I have been trying to get racon_consensus.sh to work with the DatasetA. I have run minibarcoder.py and that created an output folder named: Data_A_output

I thought that when I load up the conda environment where the minibarcoder tool is installed, that I could add the scripts folder to the PYTHONPATH. I had copied the scripts folder to the bin folder of the environment so my command to add it to the PYTHONPATH would be the following:

export PYTHONPATH=$PYTHONPATH:/work/projects/nn9305k/src/anaconda3/envs/minibarcoder/bin

When I then run the racon_consensus.sh script from within my datafolder, with this command:

racon_consensus.sh DatasetA/A_N055.fastq DatasetA/A_N055.fasta Data_A_output Data_A_output/all_barcodes.fa Data_A_consensus

I get the following output:

python: can't open file 'scripts/get_fastqs.py': [Errno 2] No such file or directory python: can't open file 'scripts/split_fasta_to_each.py': [Errno 2] No such file or directory Data_A_consensus Data_A_output DatasetA DatasetA.tgz Data_A_consensus Data_A_output DatasetA DatasetA.tgz Reference does not exist: 'Data_A_consensus/A_N055.fasta_for_graphmap/*'

For detailed help, please run with -h option.

Example usage: ./graphmap align -r escherichia_coli.fa -d reads.fastq -o alignments.sam

GraphMap (c) by Ivan Sovic, Mile Sikic and Niranjan Nagarajan GraphMap is licensed under The MIT License.

Version: v0.5.2 Build date: Aug 6 2018 at 00:05:06

racon: unrecognized option '--sam' cat: 'Data_A_consensus/A_N055.fasta_dem_fastqs/*racon.fasta': No such file or directory

My next idea was to drop the "scripts" folder in my data directory. It is not so nice, but that works for now. It is however, not a good solution, when analyzing multiple datasets.

Any idea on what ideas to pursue here? Thomas

asrivathsan commented 5 years ago

hi, I am looking into this, will get back soon

asrivathsan commented 5 years ago

Hi,

I have update racon_consensus.sh to find the path of the scripts folder. I think the current scripts"directory is in /work/projects/nn9305k/src/anaconda3/envs/minibarcoder/bin , which is where racon_consensus.sh is as well? As long as racon_consensus.sh is in PATH, this should work... Could you test what happens if you replace the old racon_consensus.sh script with the new one and run it?

Thomieh73 commented 5 years ago

Okay, I have copied the file and I have now tested it on the commandline and it does not give error messages. It starts running graphmap. I will now rerun my slurm script and test it on the dataset A. Will let you know if that give the correct output.

Thomieh73 commented 5 years ago

Hi, I have run the script racon_consensus.sh. It ran and I got quite a bit of screen output. In the end it did crash however.

This was the command I used

racon_consensus.sh DatasetA/A_N055.fastq DatasetA/A_N055.fasta Data_A_output Data_A_output/all_barcodes.fa Data_A_consensus

And this is the representative output for one of the input files loaded by the script:


O89887_MOD06_all.fa
O89887_MOD06_all
[20:00:03 BuildIndexes] Loading reference sequences.
[20:00:03 SetupIndex_] Building the index for shape: '11110111101111'.
[20:00:03 Create] Allocated memory for a list of 328 seeds (128 bits each) (0.00000 sec, diff: 0.00000 sec).
[20:00:03 Create] Memory consumption: [currentRSS = 1 MB, peakRSS = 1 MB]
[20:00:03 Create] Collecting seeds.
[20:00:03 Create] Minimizer seeds will be used. Minimizer window is 5.
[20:00:03 Create] [currentRSS = 1 MB, peakRSS = 1 MB] Sequence: 2/2, len: 654, name: 'O89887_MOD06_all;654;131'
[20:00:03 Create] Final memory allocation after collecting seeds: [currentRSS = 1 MB, peakRSS = 1 MB]
[20:00:03 Create] Sorting the seeds using 16 threads.
[20:00:03 Create] Generating the hash table.
[20:00:03 Create] Calculating the distribution statistics for key counts.
[20:00:03 Create] Index statistics: average key count = 1.000000, max key count = 1.000000, std dev = 0.000000, percentil (99.00%) (count cutoff) = 2.000000
[20:00:03 Create] Memory consumption: [currentRSS = 1 MB, peakRSS = 1 MB]
[20:00:03 SetupIndex_] Finished building index.
[20:00:03 SetupIndex_] Storing the index to file: 'Data_A_consensus/A_N055.fasta_for_graphmap/O89887_MOD06_all.fa.gmidx'.
[20:00:03 Index] Memory consumption: [currentRSS = 1 MB, peakRSS = 1 MB]

[20:00:03 Run] Hits will be thresholded at the percentil value (percentil: 99.000000%, frequency: 2).
[20:00:03 Run] Minimizers will be used. Minimizer window length: 5
[20:00:03 Run] Automatically setting the maximum allowed number of regions: max. 500, attempt to reduce after 0
[20:00:03 Run] Reference genome is assumed to be linear.
[20:00:03 Run] Only one alignment will be reported per mapped read.
[20:00:03 ProcessReads] Reads will be loaded in batches of up to 1024 MB in size.
[20:00:03 ProcessReads] Batch of 131 reads (0 MiB) loaded in 0.00 sec. (11829512 bases)
[20:00:03 ProcessReads] Memory consumption: [currentRSS = 2 MB, peakRSS = 2 MB]
[20:00:03 ProcessReads] Using 16 threads.
[20:00:03 ProcessReads] [CPU time: 0.00 sec, RSS: 2 MB] Read: 0/131 (0.00%) [m: 0, u: 0], length = 664, qname: 59e1c44d-4cb7-4be1-9c69-b2ba82c0c1237a146e42-3e[20:00:03 ProcessReads] [CPU time: 0.23 sec, RSS: 5 MB] Read: 19/131 (14.50%) [m: 9, u: 4], length = 702, qname: 76f18db3-cb89-404e-88d2-b3d20017e0835e9e55be-[20:00:03 ProcessReads] [CPU time: 0.45 sec, RSS: 5 MB] Read: 47/131 (35.88%) [m: 13, u: 19], length = 877, qname: 59987cdb-f6b5-4dac-8cbd-f447fa5ccfceb43a8de[20:00:03 ProcessReads] [CPU time: 0.71 sec, RSS: 5 MB] Read: 62/131 (47.33%) [m: 18, u: 29], length = 662, qname: 838606b5-ef13-4a91-8744-899d668bc34b79ae452[20:00:03 ProcessReads] [CPU time: 0.90 sec, RSS: 5 MB] Read: 78/131 (59.54%) [m: 24, u: 39], length = 642, qname: 82f97307-ae52-4baf-a4d6-62a41c1825e38cd8664[20:00:03 ProcessReads] [CPU time: 1.09 sec, RSS: 5 MB] Read: 93/131 (70.99%) [m: 29, u: 49], length = 713, qname: 429828e4-db36-413c-8131-e544814ae6699b8b80b[20:00:03 ProcessReads] [CPU time: 1.32 sec, RSS: 5 MB] Read: 110/131 (83.97%) [m: 35, u: 60], length = 665, qname: 581d190b-3c54-4c36-a5e3-73a3f0df150bbc47f0[20:00:03 ProcessReads] [CPU time: 1.51 sec, RSS: 5 MB] Read: 125/131 (95.42%) [m: 41, u: 69], length = 662, qname: 0145dd1a-a0f8-4946-aea9-b704c330954ceefd86[20:00:03 ProcessReads] [CPU time: 1.71 sec, RSS: 5 MB] Read: 131/131 (100.00%) [m: 51, u: 80]                                                                     

[20:00:03 ProcessReads] Memory consumption: [currentRSS = 5 MB, peakRSS = 5 MB]

[20:00:03 ProcessReads] All reads processed in 1.73 sec (or 0.03 CPU min).
racon: unrecognized option '--sam'
cat: Data_A_consensus/A_N055.fasta_dem_fastqs/*racon.fasta: No such file or directory

For a different barcode it crashes allways at the racon step...

I checked that the conda environment has racon installed. On the commandline I can type: racon and this is the head of the output:

thhaverk@login-0-0 test$ racon
[racon::] error: missing input file(s)!
usage: racon [options ...] <sequences> <overlaps> <target sequences>

    <sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences used for correction
    <overlaps>
        input file in MHAP/PAF/SAM format (can be compressed with gzip)
        containing overlaps between sequences and target sequences
    <target sequences>
        input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences which will be corrected

    options:
        -u, --include-unpolished
            output unpolished target sequences

So what is going wrong here?

Thanks by the way for modifying your script. That was a way of solving the problem that it could not find the scripts folder.

asrivathsan commented 5 years ago

Hi Thomas, The issue is with racon version, racon was upgraded with significant changes and the --sam option doesn't work and specifications of how input files have to be given have changed. I have edited for that and tested it here with the recent racon version... it seems works now: i.e. if you download the script again.

I had some issues with racon just now, even though it compiled smoothly it gave me some errors: in case you encounter an error like this https://github.com/isovic/racon/issues/65, it may require recompilation.

Thomieh73 commented 5 years ago

Hi, I have updated the script again. It worked. I get a fasta file with the racon barcodes. One things I notice is that it output the file with barcodes in the main directory instead of the directory that was indicated as output directory. I get a folder called Data_A_racon_out and a file called ``Data_A_racon_out.fa with the barcodes.

For now that is not a problem for me, since I can just adjust the commands to do the next step with aacorrection.py

At this point I am able to run the minibarcoder to generate the MAFFT barcodes, and I can generate the RACON barcodes. The next step I will do is do the error correction of both barcodes with the NT reference database and then merge the results with the consolidate script.ls

I am a little confused here on the order of the scripts. So let me explain I how understand the pipeline: Workflow steps

  1. Create barcode sequences using the MAFFT (miniBarcoder.py)
  2. Create barcode sequences using the Racon (racon_consensus.sh)
  3. AA error corrrection of both the MAFFT and Racon barcodes using blast to NT database (aacorrection.py).
  4. Combine the error corrected MAFFT and Racon barcodes into the final barcode sequences. (consolidate.py).

After the last step I will have error corrected barcode sequences from any experiment that I have processed, right? The other scripts that you created are for following the error stats and other issues that could be of interest, right?

When this all fits then I think I have my environment set-up so others at my institute can use it as well. At that point, I can give you an overview on how I set-up the conda environment for this tool.

asrivathsan commented 5 years ago

Hi Thomas, yup! We did aa correction on barcodes obtained for (1) and (2) and consolidated them. And yes the rest are just meant for comparison and quality assessment.

Thanks, would be nice to know your experience.

asrivathsan commented 5 years ago

Hi Thomas,

We have updated the pipeline considerably (faster, parallel, higher demultiplexing rate) (https://www.biorxiv.org/content/biorxiv/early/2019/04/29/622365) and also as suggested, have made it compatible with conda. I will look to put a wrapper up for the various steps soon, but meanwhile, hope this is useful.