esolares / HapSolo

Reduction of Althaps and Duplicate Contigs for Improved Hi-C Scaffolding
GNU General Public License v2.0
19 stars 6 forks source link

How to insert the output of wc -l to the array setting #1

Closed lyrk50 closed 4 years ago

lyrk50 commented 4 years ago

Hi, Thank you for developing this software. I ran wc -l jobfile.txt and the output is 20876 jobfile.txt. I try to edit sbatch_blat.sh #SBATCH -a 1-20876 , but there is an error sbatch: error: Batch job submission failed: Invalid job array specification.Could you give me some advice ? Lynn

esolares commented 4 years ago

Hi Lynn,

I'm so sorry for the late reply. I did not see a github notification for this issue. The reason you are getting this issue is because the SLURM controller is set to take a maximum number of jobs. I recommend sending 10,000 jobs 1-10000 then another 10000 10001-20000 and then the last 876 jobs by 20001-20876. If 10000 jobs is still too many you can try dividing that number in half and submit 5000 jobs at a time.

You might have to check if any jobs fail in the process. you can do this by:

ls -lS slurmjobname* | head n 25

This will sort the slurm log files by size. Usually if there are errors, it's due to the job running out of RAM. In order to make things easier I have created a branch in the project that allows for using minimap2 instead of BLAT.

Please let me know if you have any more issues.

esolares commented 4 years ago

Hi,

Are you still having issues? Please let me know. I will keep this ticket upon a few more days.

Thank you,

Edwin

wuxingbo1986 commented 4 years ago

Hi

I am having issues in sbatach_busco.sh, which also used array job submission. I have successfully divided my 10775 job into 1000 jobs per submission. However, it stopped at -a 10001-10775 and start show "sbatch: error: Batch job submission failed: Invalid job array specification." It seems the job array can not process when beyond 10000, any thought/solution to issue?

Thanks. Xingbo

esolares commented 4 years ago

Hi,

It sounds like the cluster you are working with will not accept jobs larger than that number. I recommend you split your jobfile into multiple jobfiles. you could do tail -n 775 jobfile.txt > jobfile2.txt and proceed that way. Just please make sure to change the script so it reflects the new jobfile name and go from 1-775

Thank you,

Edwin Solares, M.S. PhD Candidate in Comparative Genomics and Evolutionary Biology Department of Ecology and Evolutionary Biology Gaut Lab 5438 McGaugh Hall University of California, Irvine Irvine, CA 92697 USA

On Wed, Oct 28, 2020 at 3:08 PM wuxingbo1986 notifications@github.com wrote:

Hi

I am having issues in sbatach_busco.sh, which also used array job submission. I have successfully divided my 10775 job into 1000 jobs per submission. However, it stopped at -a 10001-10775 and start show "sbatch: error: Batch job submission failed: Invalid job array specification." It seems the job array can not process when beyond 10000, any thought/solution to issue?

Thanks. Xingbo

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/1#issuecomment-718234276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPU56AKUFYMIUJRFK33SNCIHLANCNFSM4OTCWVCQ .

wuxingbo1986 commented 4 years ago

I guess that's the case. Thanks.

I also want to confirm with you that when submit the sbtach_preprocess.sh, the REF should be the new fasta file generated from the preprocess.py?

Thanks. Xingbo

esolares commented 4 years ago

Hi,

Yes that is correct. preprocess.py creates several fasta files within the contigs folder. These files should be used to run BUCO on individually. This script also creates a new fasta file. The purpose of this, is to remove any "illegal" characters in headers that can cause issues in downstream analysis. It also reduces the header of each fasta sequence to the shortest possible without losing the unique properties of each fasta header. The new fasta file created in the root of your work directory will be appended with _new. Please use this in all subsequent analysis. i.e. all-by-all alignment. Depending on the size of your assembly, we recommend using Blat for assemblies <1Gb and minimap2 for assemblies >1Gb. If you are short on time, then you can just use minimap2, but it's possible that your purging step may not be as good.

Thank you,

Edwin Solares, M.S. PhD Candidate in Comparative Genomics and Evolutionary Biology Department of Ecology and Evolutionary Biology Gaut Lab 5438 McGaugh Hall University of California, Irvine Irvine, CA 92697 USA

On Wed, Oct 28, 2020 at 3:25 PM wuxingbo1986 notifications@github.com wrote:

I guess that's the case. Thanks.

I also want to confirm with you that when submit the sbtach_preprocess.sh, the REF should be the new fasta file generated from the preprocess.py?

Thanks. Xingbo

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/1#issuecomment-718243476, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPRJIRN22PI7EAKB54LSNCK3BANCNFSM4OTCWVCQ .

wuxingbo1986 commented 4 years ago

Hi Edwin,

I tried to run the following command and got the follow error:

hapsolo.py -i hydrangea.arrow2.scaffolding_new.fasta --paf hydrangea.arrow2.scaffolding_new.fasta_self_align.paf -b ./contigs/busco/ Traceback (most recent call last): File "/lustre/project/gbru/gbru_hydrangea/xingbo/hapsolo/HapSolo/hapsolo.py", line 819, in busco2contigdict, contigs2buscodict = importBuscos(buscofileloc) File "/lustre/project/gbru/gbru_hydrangea/xingbo/hapsolo/HapSolo/hapsolo.py", line 196, in importBuscos for line in open(mybuscofiles[0]): IndexError: list index out of range

Any ideas?

Thanks. Xingbo

esolares commented 4 years ago

Hi

It looks like it's not finding any busco output files. Could you do ls ./contigs/busco/busco1/*/

Feel free to hide any information in the filename that might compromise your study. There should be a fulltable.tsv file in that directory. If not then you can try ls ./contigs/busco/busco//full.tsv

If that still doesn't show anything then you will have to see where your busco output files are stored. If you don't find anything, which is my suspicion, could you share your busco run script with me?

Thank you,

Edwin

On Sun, Nov 1, 2020, 8:24 PM wuxingbo1986 notifications@github.com wrote:

Hi Edwin,

I tried to run the following command and got the follow error:

hapsolo.py -i hydrangea.arrow2.scaffolding_new.fasta --paf hydrangea.arrow2.scaffolding_new.fasta_self_align.paf -b ./contigs/busco/ Traceback (most recent call last): File "/lustre/project/gbru/gbru_hydrangea/xingbo/hapsolo/HapSolo/hapsolo.py", line 819, in busco2contigdict, contigs2buscodict = importBuscos(buscofileloc) File "/lustre/project/gbru/gbru_hydrangea/xingbo/hapsolo/HapSolo/hapsolo.py", line 196, in importBuscos for line in open(mybuscofiles[0]): IndexError: list index out of range

Any ideas?

Thanks. Xingbo

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/1#issuecomment-720228218, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPSY4UBMGLJI43M6UALSNYX4TANCNFSM4OTCWVCQ .

wuxingbo1986 commented 4 years ago

You are right, no tsv files was generated. The script is attached.

!/bin/bash

SBATCH -J buscocha # jobname

SBATCH -o buscocha.o%A.%a # jobname.o%j for single (non array) jobs jobnam e.o%A.%a for array jobs

SBATCH -e buscocha.e%A.%a # error file name A is the jobid and a is the ar raytaskid

SBATCH -a 100-775 # start and stop of the array start-end

SBATCH -N 1

SBATCH -n 20 # -n, --ntasks=INT Maximum number of tasks. Use for requesting a w hole node. env var SLURM_NTASKS

SBATCH -c 1 # -c, --cpus-per-task=INT The # of cpus/task. env var for threads i s SLURM_CPUS_PER_TASK

SBATCH -p short

SBATCH -t 40:00:00 # run time (dd:hh:mm:ss) - 1.5 hours

SBATCH --mail-type=end # email me when the job finishes

please load your BUSCO, Augustus, BRAKER binaries here

export PATH=/software/7/apps/augustus/3.2.3/bin:$PATH export PATH=/software/7/apps/augustus/3.2.3/scripts:$PATH export AUGUSTUS_CONFIG_PATH=/project/gbru/gbru_hydrangea/xingbo/BUSCO/AUGUSTUS_C ONFIG/config/

INPUTTYPE="geno"

please enter the directory for your ODB9 libraries here

MYLIBDIR="/lustre/project/reference/data/BUSCO/v4/lineages/"

plantea

MYLIB="embryophyta_odb10" SPTAG="arabidopsis"

OPTIONS="-l ${MYLIBDIR}${MYLIB} -sp ${SPTAG}" JOBFILE="buscojobfile.txt"

mkdir -p busco [ -d busco/busco${SLURM_ARRAY_TASK_ID} ] && rm -rf busco/busco${SLURMARRAY TASK_ID} mkdir -p busco/busco${SLURM_ARRAY_TASK_ID} TMPDIR="./busco/busco${SLURM_ARRAY_TASK_ID}" CWDIR=$(pwd)

SEED=$(head -n ${SLURM_ARRAY_TASK_ID} ${JOBFILE} | tail -n 1) cd ${TMPDIR}

echo "Begin analysis on ${SEED}"

removes escape chars and spaces. bug fix for mummer. mummer will not take escap e characters and spaces in fasta headers

echo "Begin removing invalid characters in header on ${SEED}"

ln -sf ../../${SEED}

cat ${SEED} | sed -r 's/[/ =,\t|]+//g' | awk -F "" '{ if (/^>/) {printf($1"_" $2"\n")} else {print $0} }' > $(basename ${SEED} .fasta)_new.fasta

QRY=$(basename ${SEED} .fasta)_new.fasta

QRY=${SEED}

echo "Begin quast analysis on ${QRY}" quastrun="quast -t ${SLURM_CPUS_PERTASK} ${QRY} -o quast$(basename ${QRY} .fas ta)" echo $quastrun $quastrun echo "End quast analysis, cat results and begin busco run" cat quast_$(basename ${QRY} .fasta)/report.txt > ${CWDIR}/$(basename ${QRY} .fas ta)_scoresreport.txt buscorun="BUSCO -c ${SLURM_CPUS_PERTASK} -i ${QRY} -m ${INPUTTYPE} -o $(basenam e ${QRY} .fasta)${MYLIB}${SPTAG} ${OPTIONS} -t ./run$(basename ${QRY} .fasta) ${MYLIB}${SPTAG}/tmp" echo $buscorun $buscorun echo "End busco run and cat results" cat run$(basename ${QRY} .fasta)${MYLIB}_${SPTAG}/short*.txt >> ${CWDIR}/$(bas ename ${QRY} .fasta)_scoresreport.txt cd ..

tar czf busco${SLURM_ARRAY_TASK_ID}.tar.gz busco${SLURM_ARRAY_TASK_ID}

rm -rf busco${SLURM_ARRAY_TASK_ID}

echo "Finished on ${QRY}"

Thanks. Xingbo

esolares commented 4 years ago

Hi

Thank you for sharing. I do not see the busco binary directory being added in your path. My guess is if you look at one of your error files, you Will be missing the busco binary and will have an error in stderr.

When you fix that can you make tell me the what the busco binary is? It should be BUSCO but just want to make sure.

Thank you,

Edwin

On Sun, Nov 1, 2020, 9:02 PM wuxingbo1986 notifications@github.com wrote:

You are right, no tsv files was generated. The script is attached.

!/bin/bash

SBATCH -J buscocha # jobname

SBATCH -o buscocha.o%A.%a # jobname.o%j for single (non array) jobs

jobnam e.o%A.%a for array jobs

SBATCH -e buscocha.e%A.%a # error file name A is the jobid and a is the

ar raytaskid

SBATCH -a 100-775 # start and stop of the array start-end

SBATCH -N 1

SBATCH -n 20 # -n, --ntasks=INT Maximum number of tasks. Use for

requesting a w hole node. env var SLURM_NTASKS

SBATCH -c 1 # -c, --cpus-per-task=INT The # of cpus/task. env var for

threads i s SLURM_CPUS_PER_TASK

SBATCH -p short

SBATCH -t 40:00:00 # run time (dd:hh:mm:ss) - 1.5 hours

SBATCH --mail-type=end # email me when the job finishes

please load your BUSCO, Augustus, BRAKER binaries here

export PATH=/software/7/apps/augustus/3.2.3/bin:$PATH export PATH=/software/7/apps/augustus/3.2.3/scripts:$PATH export AUGUSTUS_CONFIG_PATH=/project/gbru/gbru_hydrangea/xingbo/BUSCO/AUGUSTUS_C ONFIG/config/

INPUTTYPE="geno"

please enter the directory for your ODB9 libraries here

MYLIBDIR="/lustre/project/reference/data/BUSCO/v4/lineages/"

plantea

MYLIB="embryophyta_odb10" SPTAG="arabidopsis"

OPTIONS="-l ${MYLIBDIR}${MYLIB} -sp ${SPTAG}" JOBFILE="buscojobfile.txt"

mkdir -p busco [ -d busco/busco${SLURM_ARRAY_TASK_ID} ] && rm -rf busco/busco${SLURMARRAY TASK_ID} mkdir -p busco/busco${SLURM_ARRAY_TASK_ID} TMPDIR="./busco/busco${SLURM_ARRAY_TASK_ID}" CWDIR=$(pwd)

SEED=$(head -n ${SLURM_ARRAY_TASK_ID} ${JOBFILE} | tail -n 1) cd ${TMPDIR}

echo "Begin analysis on ${SEED}"

removes escape chars and spaces. bug fix for mummer. mummer will not take

escap e characters and spaces in fasta headers

echo "Begin removing invalid characters in header on ${SEED}"

ln -sf ../../${SEED}

cat ${SEED} | sed -r 's/[/ =,\t|]+//g' | awk -F "" '{ if (/^>/)

{printf($1"_" $2"\n")} else {print $0} }' > $(basename ${SEED} .fasta)_new.fasta

QRY=$(basename ${SEED} .fasta)_new.fasta

QRY=${SEED}

echo "Begin quast analysis on ${QRY}" quastrun="quast -t ${SLURM_CPUS_PERTASK} ${QRY} -o quast$(basename ${QRY} .fas ta)" echo $quastrun $quastrun echo "End quast analysis, cat results and begin busco run" cat quast_$(basename ${QRY} .fasta)/report.txt > ${CWDIR}/$(basename ${QRY} .fas ta) scoresreport.txt buscorun="BUSCO -c ${SLURM_CPUS_PER_TASK} -i ${QRY} -m ${INPUTTYPE} -o $(basenam e ${QRY} .fasta)${MYLIB}${SPTAG} ${OPTIONS} -t ./run$(basename ${QRY} .fasta) ${MYLIB}${SPTAG}/tmp" echo $buscorun $buscorun echo "End busco run and cat results" cat run_$(basename ${QRY} .fasta)${MYLIB}${SPTAG}/short*.txt >> ${CWDIR}/$(bas ename ${QRY} .fasta)_scoresreport.txt cd ..

tar czf busco${SLURM_ARRAY_TASK_ID}.tar.gz busco${SLURM_ARRAY_TASK_ID}

rm -rf busco${SLURM_ARRAY_TASK_ID}

echo "Finished on ${QRY}"

Thanks. Xingbo

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/1#issuecomment-720236654, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPRQJU2JYPU467ZVOWDSNY4LXANCNFSM4OTCWVCQ .

wuxingbo1986 commented 4 years ago

Hi Edwin,

Thanks for pointing out. I do realize that both quest and busco path were not added.

Now I have them into PATH. The busco (binary name) seems have problem of recognizing your code : OPTIONS="-l ${MYLIBDIR}${MYLIB} -sp ${SPTAG}"

The error is: busco: error: unrecognized arguments: -sp ${SPTAG} -t ./run_tig00000000_embryophyta_odb10_arabidopsis/tmp

cat:run_tig00000000_embryophyta_odb10_arabidopsis/short*.txt: No such file or directory

I suppose I can take out the {SPTAG}, how to deal with the -t argument ?

Thanks. Xingbo

esolares commented 4 years ago

Hi

Are you not choosing a species? At the top of the code you should choose a species got Augustus, if you don't it will just choose on for you. Also -t sets a temporary directory, this helps with avoiding issues with over writing other runs.

Also what version of busco are you running? It should be the version supported by hapsolo. Busco 4 is currently not supported.

Thank you,

Edwin

On Mon, Nov 2, 2020, 6:44 AM wuxingbo1986 notifications@github.com wrote:

Hi Edwin,

Thanks for pointing out. I do realize that both quest and busco path were not added.

Now I have them into PATH. The busco (binary name) seems have problem of recognizing your code : OPTIONS="-l ${MYLIBDIR}${MYLIB} -sp ${SPTAG}"

The error is: busco: error: unrecognized arguments: -sp ${SPTAG} -t ./run_tig00000000_embryophyta_odb10_arabidopsis/tmp

cat:run_tig00000000_embryophyta_odb10_arabidopsis/short*.txt: No such file or directory

I suppose I can take out the {SPTAG}, how to deal with the -t argument ?

Thanks. Xingbo

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/1#issuecomment-720514246, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPUCVJOHGFMTUI2FG2DSN3ARZANCNFSM4OTCWVCQ .

wuxingbo1986 commented 4 years ago

Edwin,

I just switched to busco3 (was using busco4). then the following error pop up:

File "/software/7/apps/busco3/3.1.0/scripts/run_BUSCO.py", line 26, in from pipebricks.PipeLogger import PipeLogger ModuleNotFoundError: No module named 'pipebricks'

I am using python3 by default, should I switch to python2?

Seems like you spot every problem I have got. Thanks.

esolares commented 4 years ago

Sounds like your version of busco isn't fully installed. I would try only running it on one contig by changing the value in SBATCH -a to be SBATCH -a 1

The python versions shouldn't matter. I recommend installing busco3 via conda. It should run fine with python3. I think you are just missing python modules

Here is a link on how to install conda3 https://bioconda.github.io/user/install.html

and instructions here: make sure you go to your shared binaries folder.

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.shsh Miniconda3-latest-Linux-x86_64.sh

After install you will be shown an rc file that you can source. I do not recommend having it edit your default login sh file.

Once that is done. Please install bioconda Source your conda3 install then: if you want to check, after sourcing the condarc file you should see the following: (base) $ If everything is good then run the following:

conda config --add channels defaultsconda config --add channels biocondaconda config --add channels conda-forge

Note: If you already have conda3 installed, I recommend you re add the channels in this order as it has been updated as of recently.

First create an environment:

conda create --name busco3 conda activate busco3 conda install -c bioconda busco=3.0.2 conda install -c bioconda quast

Make sure current versions are not updated when you run the last line for quast. It should not but Just in case.

Now you can source your condarc file and activate the busco3 environment in your sbatch script. Please let me know if you still have issues. Thank you for elevating this issue. I will update the readme so that users install miniconda3.

On Mon, Nov 2, 2020, 9:43 AM wuxingbo1986 notifications@github.com wrote:

Edwin,

I just switched to busco3 (was using busco4). then the following error pop up:

File "/software/7/apps/busco3/3.1.0/scripts/run_BUSCO.py", line 26, in from pipebricks.PipeLogger import PipeLogger ModuleNotFoundError: No module named 'pipebricks'

I am using python3 by default, should I switch to python2?

Seems like you spot every problem I have got. Thanks.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/1#issuecomment-720622594, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPUFJSBPE4VTLTFAJ2DSN3VRNANCNFSM4OTCWVCQ .

wuxingbo1986 commented 4 years ago

Edwin,

The programming is running now but I have another issue. I have got bunco results like:

"C:0.0%[S:0.0%,D:0.0%],F:0.0%,M:100.0%,n:1440" and ask me to check my augustus.log for the problem, which showed:

Warning: Block unknown_F is not significant enough, removed from profile. Warning: Block unknown_G is not significant enough, removed from profile. Warning: Block unknown_K is not significant enough, removed from profile. Will create parameters for a EUKARYOTIC species! creating directory /home/xxxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/ ... creating /home/xxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_parameters.cfg ... creating /home/xxxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_weightmatrix.txt ... creating /home/xxxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_metapars.cfg ... The necessary files for training BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388 have been created. Now, either run etraining or optimize_parameters.pl with --species=BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388. etraining quickly estimates the parameters from a file with training genes. optimize_augustus.pl alternates running etraining and augustus to find optimal metaparameters.

/software/7/apps/augustus/3.2.3/bin/etraining: ERROR Input file not in genbank format.

Warning: Block unknown_F is not significant enough, removed from profile.

/software/7/apps/augustus/3.2.3/bin/augustus: ERROR ExonModel: Couldn't open file /home/xxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_exon_probs.pbl

/software/7/apps/augustus/3.2.3/bin/augustus: ERROR ExonModel: Couldn't open file /home/xxxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_exon_probs.pbl

/software/7/apps/augustus/3.2.3/bin/augustus: ERROR ExonModel: Couldn't open file /home/xxxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_exon_probs.pbl

Warning: Block unknown_G is not significant enough, removed from profile. Warning: Block unknown_K is not significant enough, removed from profile.

/software/7/apps/augustus/3.2.3/bin/augustus: ERROR ExonModel: Couldn't open file /home/xxxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_exon_probs.pbl

/software/7/apps/augustus/3.2.3/bin/augustus: ERROR ExonModel: Couldn't open file /home/xxxx/AUGUSTUS_CONFIG/config/species/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388/BUSCO_tig00036361_embryophyta_odb9_arabidopsis_161133388_exon_probs.pbl