PhyloGrok / VCFgenerator

Automated variant calling app for NextGen evolutionary genomics

GNU General Public License v3.0

0 stars 0 forks source link

Variant calling workflow - script automation #3

Open PhyloGrok opened 1 year ago

PhyloGrok commented 1 year ago

Basically 3 parts, each part should have its own shell script file:

Retrieving SRA files (.fastq format) using EDirect and SRA-toolkit.
Quality control - trimmomatic and fastqc
Assembly and variant calling - bwa, samtools

PhyloGrok commented 1 year ago

Now it looks like the .fastq sequence retrieval part is robust. Storage may become an issue, likely need to request additional storage for larger datasets. .sam, .bam, and .bcf files are also very large. We'll need to dynamically delete them after processing to conserve storage.

Currently waiting for Lloyd's python scripts which will take user inputs and control the execution of the steps of the workflow.

PhyloGrok commented 1 year ago

This StackExchange has some initial ideas, they are using the Python function subprocess.call() : https://stackoverflow.com/questions/32085956/pass-a-variable-from-python-to-shell-script

PhyloGrok commented 1 year ago

Here's a different one where they seem to pass the output of test.py to a shell variable. https://stackoverflow.com/questions/2796932/how-do-i-pass-a-python-variable-to-bash

PhyloGrok commented 1 year ago

This one has a simple example of how the script should run, where Python variables are passed to the bash shell. https://unix.stackexchange.com/questions/466190/passing-python-variable-to-embedded-shell-script

LloydJonesIII commented 1 year ago

./R03.sh: line 21: datasets: command not found ** New error found will continue to work on this when I can over the weekend

LloydJonesIII commented 1 year ago

Completed first working variant of the Python Hub

works with:
esearch
prefetch
fasterq-dump working on the trimmomatic step to have multiple instances running to complete faster

LloydJonesIII commented 1 year ago

made major headway on the python split function being used to process our sra lists to run multiple trimmomatic instances in parallel

LloydJonesIII commented 1 year ago

Testing python hub code trial3.py you need the following function located in the same directory

R03.sh
Split.py
trimmer2
trimmer3 trimmer2 and trimmer3 are the trimmomatic shells designed for split.py for split.py to work trimmer2 and trimmer3 need to be in the same directory as split.py

LloydJonesIII commented 1 year ago

Current list of used resources Links

https://stackoverflow.com/questions/48209410/cant-open-sh-file https://stackoverflow.com/questions/32085956/pass-a-variable-from-python-to-shell-script https://stackoverflow.com/questions/65153137/multiple-inputs-using-subprocess-run-in-python-3-7 https://stackoverflow.com/questions/17742789/running-multiple-bash-commands-with-subprocess https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3 https://stackoverflow.com/questions/3172470/actual-meaning-of-shell-true-in-subprocess#:~:text=Setting%20the%20shell%20argument%20to,before%20the%20command%20is%20run. https://unix.stackexchange.com/questions/242334/notepad-adds-r-to-shell-scripts https://support.nesi.org.nz/hc/en-gb/articles/218032857-Converting-from-Windows-style-to-UNIX-style-line-endings#:~:text=Converting%20using%20Notepad%2B%2B&text=To%20write%20your%20file%20in,with%20UNIX%2Dstyle%20line%20endings. https://stackoverflow.com/questions/31786287/how-to-split-large-text-file-in-windows https://www.tutorialspoint.com/How-to-read-a-file-from-command-line-using-Python#:~:text=Reading%20a%20file%20from%20command,file%20and%20read%20its%20contents. https://stackoverflow.com/questions/17255737/importing-variables-from-another-file https://www.pythonforbeginners.com/files/the-fastest-way-to-split-a-text-file-using-python

LloydJonesIII commented 1 year ago

python split.py file concept has been tested and confirmed to be working with trimmomatic command line shells

trimmomatic parallel shell has been proven to work

LloydJonesIII commented 1 year ago

In order to get the hub Python script to work as intended a higher Python script has been added to the script order as not to have the user constantly prompted for inputs as that's not what we wanted the new script is called

Controller.py

LloydJonesIII commented 1 year ago

Finished creating and testing the head python controller which will control all downstream shell and python scripts

LloydJonesIII commented 1 year ago

code error found

indexing error

LloydJonesIII commented 1 year ago

Troubleshooting variant calling automation step currently stuck on this error set

[bwa_index] Pack FASTA... 0.01 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 0.39 seconds elapse. [bwa_index] Update BWT... 0.01 sec [bwa_index] Pack forward-only FASTA... 0.01 sec [bwa_index] Construct SA from BWT and Occ... 0.16 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index ../../media/volume/sdb/attempt7/assembly/reference/ref_genome [main] Real time: 0.617 sec; CPU: 0.574 sec SRR9025102 Variant calling process has begun [M::bwa_idx_load_from_disk] read 0 ALT contigs '.::main_mem] fail to open file `

LloydJonesIII commented 1 year ago

work completed friday 07-07-23

looked into tkinter for Gui integration
generated first iterations of variant calling automation codes work completed sunday 07-09-23
Trouble shooting code made on Friday
Made a new iteration did not yet new iteration on Sunday

LloydJonesIII commented 1 year ago

New Error Code found and needs to be worked through

[main] CMD: bwa mem ../../media/volume/sdb/attempt11/assembly/reference/ref_genome.fasta ../../media/volume/sdb/attempt11/fastq/trimmed/SRR9025118_1.trim.fastq.gz ../../media/volume/sdb/attempt11/fastq/trimmed/SRR9025118_2.trim.fastq.gz [main] Real time: 25.855 sec; CPU: 26.814 sec [E::hts_open_format] Failed to open file ../../media/volume/sdb/attempt11/assembly/results/sam/SRR9025118.aligned.sam samtools view: failed to open "../../media/volume/sdb/attempt11/assembly/results/sam/SRR9025118.aligned.sam" for reading: No such file or directory [E::fai_build3_core] Failed to open the file ../../media/volume/sdb/attempt11/assembly/reference/ref_genome [E::hts_open_format] Failed to open file ../../media/volume/sdb/attempt11/assembly/results/bcf/SRR9025118_raw.bcf : No such file or directoryvolume/sdb/attempt11/assembly/results/bcf/SRR9025118_raw.bcf Can't open ../../media/volume/sdb/attempt11/assembly/results/vcf/SRR9025118.vcf: No such file or directory at /home/exouser/anaconda3/bin/vcfutils.pl line 265. SRR9025118 Variant calling process has finished

LloydJonesIII commented 1 year ago

07-14-23 (12-3pm)

spent three hours troubleshooting and trying different code designs to get the variant calling code to work
ended last run early as it was through errors
current errors I'm running into include missing files and bam files with little to no contents

LloydJonesIII commented 1 year ago

07-18-23

found a possible fix to the running error I have been stuck on for a while, the error was a txt file formating error causing file reading errors to occur as there were invisible ^M's attached to filenames, this was caused by the current text file editor I am using as ^M acts as the newline indicator for the .sh files which is read fine for all other commands except the variant calling commands which cannot interpret it properly

LloydJonesIII commented 1 year ago

possible solution can be adding dos2unix to our VM so I can convert or I can try completely rewriting my command through command line to see if it fixes the error

LloydJonesIII commented 1 year ago

retyping the function entirely through nano in command line has solved this issue within the variant calling loop uploading all working files to github python hub directory

LloydJonesIII commented 1 year ago

07-21-23

found issue with multi-threading trimmomatic step may be due to internet connectivity issues ran into a similar problem last semester with VM work on my home internet will keep the process the same as it seems to be working besides crashing my home connection other steps are working as intended

LloydJonesIII commented 1 year ago

07-24-23

testing code reverted back to a previous trimmomatic methodology
full run-through attempt number 5 without multi-threading trimmomatic has been proven to work from start to finish with the smaller original project
future testing will be:
to create a multithreaded trimmomatic that works without freezing up the VM
rework user input sections to allow for greater flexibility
run a larger project for stress testing to see where alterations need to be made to allow for larger projects

LloydJonesIII commented 1 year ago

07-25-23

to see limits: re-run with '-x' option.

============================================================= An error occurred during processing. A report was generated into the file '/home/intern4/ncbi_error_report.txt'. If the problem persists, you may consider sending the file to 'sra-tools@ncbi.nlm.nih.gov' for assistance.

fasterq-dump quit with error code 3 Archive: ../../media/volume/sdb/BigRun1/assembly/2242.zip inflating: ../../media/volume/sdb/BigRun1/assembly/reference/README.md inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/assembly_data_report.jsonl inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/GCF_004799605.1/GCF_004799605.1_ASM479960v1_genomic.fna inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/GCF_004799605.1/protein.faa inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/GCF_004799605.1/cds_from_genomic.fna inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/GCF_004799605.1/genomic.gff inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/GCF_004799605.1/genomic.gtf inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/GCF_004799605.1/genomic.gbff inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/GCF_004799605.1/sequence_report.jsonl inflating: ../../media/volume/sdb/BigRun1/assembly/reference/ncbi_dataset/data/dataset_catalog.json Gzip process has begun

gzip: ../../media/volume/sdb/BigRun1/fastq/SRR19515813.fastq.gz: No space left on device

LloydJonesIII commented 1 year ago

Typically this error happens when you exhaust the storage set aside for your SRA workspace. Oftentimes, users are unaware that we are using a workspace to cache downloaded data, and by default this is in your $HOME directory, although it almost never is the best place (it's the only one we can set as default, though!).
Might be able to solve this?

LloydJonesIII commented 1 year ago

08-01-23

found solution to fasterq-dump error
the program creates very large temporary table files in order to get the fastq files faster
the problem was occurring because I wasn't specifying a location for the temp files to be located so they were being generated in my user directory and not the mounted storage
using -t
I was able to specify where I wanted the temp files to be generated

LloydJonesIII commented 1 year ago

08-07-23

got a working variant of the splitter function working, seems to be properly splitting files