erhard-lab / price

Improved Ribo-seq enables identification of cryptic translation events
10 stars 0 forks source link

fastqfilter step hangning when runnnin multiple alignements #12

Closed jflucier closed 4 years ago

jflucier commented 4 years ago

Hi Florian,

still having problem on the fastqfilter step.

As you will see in the script below, I changed the alignement process so that every dependencies (Gedi, Reaper, R, bowtie, jdk, etc...) are copied locally to a compute node.

When I execute 2 jobs, the fastqfilter message at beginning of execution outputs well:

`

[reaper] check 0 errors, 0 reads truncated, 50391296 clean, 0 lint, 50391296 total 2020-04-28 14:52:07.274 INFO Command: gedi -e FastqFilter -D -ld ERX609887.readlengths.tsv -min 18 ERX609887. lane.clean 2020-04-28 14:53:09.195 INFO Discovering classes in classpath 2020-04-28 14:53:09.550 INFO Preparing simple class references 2020-04-28 14:53:09.829 INFO Gedi 1.0.2 (JAR) startup 2020-04-28 15:49:43.093 INFO Finished: gedi -e FastqFilter -D -ld ERX609887.readlengths.tsv -min 18 ERX609887 .lane.clean aligning using bowtie on rrna

`

When I start 15 jobs, the log output hangs after te reaper execution:

` [reaper] check 0 errors, 0 reads truncated, 144442614 clean, 0 lint, 144442614 total

` been stuck at this log for more then an hour. When I run 2 jobs, this message appears within second / minutes.

When I look at the process on the compute node, it shows:

jflucier 25339 0.0 0.2 36534756 91492 ? Sl 09:09 0:05 java -Xmx30g -Xms2048m -cp /localscratch/jflucier.5320192.0/ERX609888/programs/gedi/gedi.jar:/localscratch/jflucier.5320192.0/ERX609888/programs/gedi/lib/.* executables.FastqFilter -D -ld ERX609888.readlengths.tsv -min 18 ERX609888.lane.clean

As you can see, everything point to local drive /localscratch/jflucier.5320192.0/. This command is excuted in the /localscratch/jflucier.5320192.0 folder

Do you have an idea of what is going wrong? Can you briefly detail what this step does as I am thinking of rewriting it. Does it only filter the fastq for read size length and then plot graphs?

When I look at the fastq code, it looks like it filters for read size > 18 and change the fastq header line to a counter that is incrementing for each read. Is this the only thing the code does (beside the R plotting)?

Following is my alignment script:

`

!/bin/bash

export MUGQIC_INSTALL_HOME=/cvmfs/soft.mugqic/CentOS6 module use $MUGQIC_INSTALL_HOME/modulefiles

module add bowtie/1.1.2 module add mugqic/star/2.7.3a module add bedtools/2.26.0 module add samtools/1.5 module add ascp/3.5.4

module add sra-toolkit/2.9.6

module add python/3.6.3 module add java/1.8.0_121 module add gcc/5.4.0 r/3.5.0 module rm perl/5.16.3

export SAMPLE_SHEET="" export ANN_REF="" export RRNA_REF="" export STUDY_NAME="" export OUTPATH="" export SAMPLE_CIT_LIST="" export BOWTIE_INDEX="/nfs3_ib/ip32-ib/home/xroucou_group/analysis/price/openprot_1.4/gtf_1.4" export RUN_NAME=""

if [ -z "$SLURM_TMPDIR" ] then mkdir -p /home/$USER/temp SLURM_TMPDIR="/home/$USER/temp" echo "\$SLURM_TMPDIR is empty. Will use $SLURM_TMPDIR as temp folder." else echo "Temp folder is $SLURM_TMPDIR" fi

mkdir -p $SLURM_TMPDIR/$RUN_NAME__ cd $SLURM_TMPDIR/$RUN_NAME__

echo "copying programs" mkdir $SLURM_TMPDIR/${__RUN_NAME__}/programs

rsync -ah --progress /nfs3_ib/ip32-ib/home/xroucou_group/apps/r_lib_3.5 $SLURM_TMPDIR/${__RUN_NAME__}/programs/

rsync -ah --progress /nfs3_ib/ip32-ib/home/xroucou_group/apps/gedi/build/ $SLURM_TMPDIR/${RUN_NAME}/programs/gedi rsync -ah --progress /nfs3_ib/ip32-ib/home/xroucou_group/apps/Price_1.0.2/reaper-15-065/ $SLURM_TMPDIR/${RUN_NAME}/programs/reaper-15-065 rsync -ah --progress /nfs3_ib/ip32-ib/home/xroucou_group/apps/sratoolkit.2.10.4-centos_linux64/ $SLURM_TMPDIR/${RUN_NAME__}/programs/sratoolkit rsync -ah --progress /nfs3_ib/ip32-ib/home/xroucou_group/apps/edirect/ $SLURM_TMPDIR/${RUN_NAME}/programs/edirect rsync -ah --progress /nfs3_ib/ip32-ib/home/xroucou_group/apps/jdk/jdk1.8.0_161/ $SLURM_TMPDIR/${RUN_NAME}/programs/jdk1.8.0_161 rsync -ah --progress /nfs3_ib/ip32-ib/home/xroucou_group/apps/R $SLURM_TMPDIR/${RUN_NAME__}/programs/

echo "copying genome" mkdir $SLURM_TMPDIR/${RUN_NAME}/genome rsync -ah --progress $BOWTIE_INDEX/${RRNA_REF__}.* $SLURM_TMPDIR/${RUN_NAME}/genome/ rsync -ah --progress $BOWTIE_INDEX/${ANN_REF}.* $SLURM_TMPDIR/${__RUN_NAME}/genome/

export R_LIBS_USER=$SLURM_TMPDIR/${__RUN_NAME__}/r_lib_3.5

export R_LIBS_USER=/nfs3_ib/ip32-ib/home/xroucou_group/apps/r_lib_3.5 export R_DATATABLE_NUM_PROCS_PERCENT=100 export PATH=/nfs3_ib/ip32-ib/home/xroucou_group/apps/miniconda3/bin:$PATH export PATH=$SLURM_TMPDIR/${RUN_NAME__}/programs/gedi:$PATH export PATH=$SLURM_TMPDIR/${RUN_NAME}/programs/reaper-15-065/src:$PATH export PATH=/cvmfs/soft.computecanada.ca/easybuild/software/2017/sse3/Compiler/intel2016.4/bowtie/1.1.2/bin:$PATH export PATH=$SLURM_TMPDIR/${__RUN_NAME}/programs/sratoolkit/bin:$PATH export PATH=$SLURM_TMPDIR/${RUN_NAME__}/programs/edirect:$PATH export PATH=$SLURM_TMPDIR/${RUN_NAME__}/programs/jdk1.8.0_161/bin:$PATH

export JAVA_HOME=$SLURM_TMPDIR/${__RUN_NAME__}/programs/jdk1.8.0_161 export ALTORF_SCRIPT=/nfs3_ib/ip32-ib/home/xroucou_group/apps/altorf/scripts export PERL5LIB=/nfs3_ib/ip32-ib/home/xroucou_group/apps_centos7/perl5/lib/perl5 export NCBI_API_KEY=caf7771b1c9c667070db2f515f28addb5908

echo "fetch run info"

RUNINFO=$(esearch -db sra -query $__RUN_NAME__ | efetch -format runinfo | awk 'NR>=2' | cut -d "," -f1)

#

echo "fetch fastq"

source /nfs3_ib/ip32-ib/home/xroucou_group/apps/miniconda3/etc/profile.d/conda.sh

conda activate price

mkdir $SLURM_TMPDIR/${__RUN_NAME__}/tmp_fastq

mkdir $SLURM_TMPDIR/${__RUN_NAME__}/fastq

for run in $RUNINFO

do

echo "downloading $run"

parallel-fastq-dump --sra-id $run --threads 23 --tmpdir $SLURM_TMPDIR/${RUN_NAME}/tmp_fastq --outdir $SLURM_TMPDIR/${RUN_NAME}/fastq --split-files

done

conda deactivate

#

cat $SLURM_TMPDIR/${RUN_NAME}/fastq/*.fastq > $RUN_NAME.fastq

fastq-dump -Z $RUNINFO > $__RUN_NAME__.fastq

if [ -f "${OUTPATH}/../fastq/${RUN_NAME}.fastq" ] then echo "copying fastq ${RUN_NAME}.fastq to $SLURM_TMPDIR/${__RUN_NAME__}/"

cp ${OUTPATH}/../fastq/${RUN_NAME__}.fastq $SLURM_TMPDIR/${RUN_NAME__}/

rsync -ah --progress ${__OUTPATH__}/../fastq/${__RUN_NAME__}.fastq $SLURM_TMPDIR/${__RUN_NAME__}/

else echo "${OUTPATH}/../fastq/${__RUN_NAME__}.fastq does not exist. Please make sure it is available. Have you run the refetch script?" exit 2 fi

echo "Counting all reads" echo -e "Category\tCount" > $RUN_NAME.reads.tsv echo -ne "All\t" >> $RUN_NAME.reads.tsv L=wc -l $__RUN_NAME__.fastq | cut -f1 -d' ' leftreads=$((L / 4)) echo $leftreads >> $__RUN_NAME__.reads.tsv

echo "find adaptor sequence using minion" head -n4000000 $RUN_NAME.fastq > $RUN_NAME.head.fastq minion search-adapter -i $RUN_NAME.head.fastq -show 1 -write-fasta adapter.fasta rm $RUN_NAME.head.fastq ADAPT=head -n2 adapter.fasta | tail -n1

echo "cleaning and filtering reads using reaper" reaper --nozip --noqc -3p-prefix 1/1/0/0 -swp 1/4/4 -geom no-bc -i $RUN_NAME.fastq -basename $RUN_NAME -3pa $ADAPT

export PATH=/nfs3_ib/ip32-ib/home/xroucou_group/apps/R:$PATH

export PATH=$SLURM_TMPDIR/${RUN_NAME}/programs/R:$PATH gedi -e FastqFilter -D -ld $RUN_NAME.readlengths.tsv -min 18 $RUN_NAME.lane.clean > $RUN_NAME.fastq rm $RUN_NAME.lint $RUN_NAME.lane.clean

echo -ne "Trimmed\t" >> $RUN_NAME.reads.tsv L=`wc -l $RUN_NAME.fastq | cut -f1 -d' '` leftreads=$((L / 4)) echo $leftreads >> $__RUN_NAME__.reads.tsv

echo "aligning using bowtie on rrna" bowtie --threads 23 -a -m 100 -v 3 --best --strata --norc --un ${RUN_NAME}_unmapped.fastq --sam $SLURM_TMPDIR/${RUN_NAME}/genome/${RRNA_REF}.genomic $RUN_NAME.fastq /dev/null mv ${RUN_NAME__}_unmapped.fastq $RUN_NAME.fastq echo -ne "rRNA removal\t" >> $RUN_NAME.reads.tsv leftreads=$( grep -c @ $RUN_NAME.fastq ) echo $leftreads >> $RUN_NAME__.reads.tsv

echo "genomic alignment using bowtie on reference" bowtie --threads 23 -a -m 100 -v 3 --best --strata --sam $SLURM_TMPDIR/${RUN_NAME}/genome/${ANN_REF}.genomic $RUN_NAME.fastq $ANN_REF.Genomic.sam

echo "samtool sort on genomic alignment" samtools sort -o $ANN_REF__.Genomic.sort.sam -n -T $SLURM_TMPDIR/${RUN_NAME}/sort -@ 8 -m 1G $ANN_REF.Genomic.sam rm $SLURM_TMPDIR/${__RUN_NAME}/sort*.bam mv $ANN_REF.Genomic.sort.sam $ANN_REF.Genomic.sam unali=$( grep -v @ $ANN_REF.Genomic.sam | cut -f2 | grep -c "^4$" ) echo -ne "$ANN_REF(Genomic)\t" >> $RUN_NAME.reads.tsv echo $((leftreads-unali)) >> $RUN_NAME.reads.tsv

echo "transcriptomic alignment using bowtie on reference" bowtie --threads 23 -a -m 100 -v 3 --best --strata --norc --sam $SLURM_TMPDIR/${RUN_NAME}/genome/${ANN_REF}.transcriptomic $RUN_NAME.fastq ${ANN_REF}.Transcriptomic.sam

echo "samtool sort on transcriptomic alignment" samtools sort -o ${ANN_REF__}.Transcriptomic.sort.sam -n -T $SLURM_TMPDIR/${RUN_NAME}/sort -@ 8 -m 1G ${ANN_REF}.Transcriptomic.sam rm $SLURM_TMPDIR/${__RUN_NAME}/sort*.bam mv ${ANN_REF}.Transcriptomic.sort.sam ${ANN_REF}.Transcriptomic.sam unali=$( grep -v @ ${ANN_REF}.Transcriptomic.sam | cut -f2 | grep -c "^4$" ) echo -ne "${ANN_REF}(Transcriptomic)\t" >> $RUN_NAME.reads.tsv echo $((leftreads-unali)) >> $RUN_NAME.reads.tsv

mkdir -p $OUTPATH/report echo "gedi mergesam step" gedi -t . -e MergeSam -D -genomic ${ANN_REF} -t $OUTPATH/scripts/${RUN_NAME}.prio.csv -prio $OUTPATH/scripts/${RUN_NAME}.prio.oml -chrM -o $RUN_NAME.cit echo -ne "Merged\t" >> $__RUN_NAME__.reads.tsv

echo "gedi Nashorn step" gedi Nashorn -e "println(EI.wrap(DynamicObject.parseJson(FileUtils.readAllText(new File('$RUN_NAME.cit.metadata.json'))).getEntry('conditions').asArray()).mapToDouble(function(d) d.getEntry('total').asDouble()).sum())" >> $RUN_NAME.reads.tsv

echo "copying results to $OUTPATH" cp .readlengths. $OUTPATH/report cp $__RUN_NAME.reads.tsv $OUTPATH__/report

if [ -f "$RUN_NAME.cit" ] then cp $RUN_NAME.cit* $OUTPATH

rm -rf $SLURM_TMPDIR/$__RUN_NAME__

else echo "There were some errors, will copy temp directory to $OUTPATH" cp -r * $OUTPATH fi

echo "output R plots" Rscript $OUTPATH/report/${RUN_NAME}.reads.R echo '{"plots":[{"section":"Mapping statistics","id":"mapping'$RUN_NAME'","title":"'$RUN_NAME'","description":"How many reads are removed by adapter trimming and rRNA mapping, how many are mapped to which reference and retained overall.","img":"'$RUN_NAME'.reads.png","script":"'$RUN_NAME'.reads.R","csv":"'$RUN_NAME'.reads.tsv"}]}' > $RUN_NAME.reads.report.json cp $RUN_NAME.reads.report.json $OUTPATH/report echo "done pipeline for $__RUN_NAME__"

`

thanks for your help JF

jflucier commented 4 years ago

I did another check and when I run the mergesam command on multiple compute nodes, I have the exact same problem.

This command hangs forever:

gedi -t . -e MergeSam -D -genomic ${ANN_REF} -t $OUTPATH/scripts/${RUN_NAME}.prio.csv -prio $OUTPATH/scripts/${RUN_NAME}.prio.oml -chrM -o $RUN_NAME.cit

again this is run in the local scratch folder of the compute node.

So I think this problem is in the Gedi engine initialisation and not the specifiac code of fastqfilter or mergesam

thanks JF

jflucier commented 4 years ago

Hi again Florian,

I think I have found the bug with this issue... I replaced all occurent in gedi code where a call to user home directory is made and added the nfs hook ancher to those call.

I identified the changes to be made using this grep command and edited each reference by adding string ""/nfs3_ib/ip32-ib"+". Then I rebuild the gedi project :

`

|11:14:07|jflucier@ip32-centos7:[openprot_1.4]> grep -re 'System.getProperty("user.home")' /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/GediCore/src/gedi/app/Config.java: private static final String DEFAULT_FOLDER = "/nfs3_ib/ip32-ib"+System.getProperty("user.home")+"/.gedi"; /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/GediCore/src/jline/internal/Configuration.java: return new File("/nfs3_ib/ip32-ib"+System.getProperty("user.home")); /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/build/src/gedi/app/config/StartCommandConfigurator.java: String lines = new LineOrientedFile("/nfs3_ib/ip32-ib"+System.getProperty("user.home")+"/.bashrc").readAllText(); /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/build/src/gedi/app/config/StartCommandConfigurator.java: LineWriter writer = new LineOrientedFile("/nfs3_ib/ip32-ib"+System.getProperty("user.home")+"/.bashrc").write(); /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/build/src/gedi/app/Config.java: private static final String DEFAULT_FOLDER = "/nfs3_ib/ip32-ib"+System.getProperty("user.home")+"/.gedi"; /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/build/src/net/sf/cglib/TestAll.java: public static String DEFAULT_DEBUG_LOACATION = "/nfs3_ib/ip32-ib"+System.getProperty("user.home") + /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/build/src/jline/internal/Configuration.java: return new File("/nfs3_ib/ip32-ib"+System.getProperty("user.home")); /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/GediFx/src/gedi/app/config/StartCommandConfigurator.java: String lines = new LineOrientedFile("/nfs3_ib/ip32-ib"+System.getProperty("user.home")+"/.bashrc").readAllText(); /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/GediFx/src/gedi/app/config/StartCommandConfigurator.java: LineWriter writer = new LineOrientedFile("/nfs3_ib/ip32-ib"+System.getProperty("user.home")+"/.bashrc").write(); /nfs3_ib/ip32-ib/home/xroucou_group/apps/test/gedi/cglib/src/net/sf/cglib/TestAll.java: public static String DEFAULT_DEBUG_LOACATION = "/nfs3_ib/ip32-ib"+System.getProperty("user.home") +

`

I have run successfully a test running 15 alignement jobs in parralle. Curremtly trying 100.

Stil currious to know if the fastqfilter step only filters to remove reads <= 18 nt and recode fastq headers with a number. If so I found the follwing awk command to perform better:

`

echo "filtering fastq for size and recoding" awk ' BEGIN { FS = "\t" ; OFS = "\n"; count = 0; } { count++; header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 18) { print "@" count, seq, "+", qseq } } ' < ${RUN_NAME}.lane.clean > ${RUN_NAME}.fastq

`

thanks for your help,

JF

florianerhard commented 4 years ago

Dear JF,

sorry for not responding for the last few days.

I am not sure how this all connects. If you need to change the value of System.getProperty("user.home"), i suggest to go with java -Duser.home=... instead of patching the code.

Regarding FastqFilter, you are right, at the moment it only does length filtering, length statistics and renaming the reads. It's clear that there are faster ways of doing it, but we would like to keep the flexibility of adding additional filters into this program without changing the pipeline code itself.

Best, Florian

jflucier commented 4 years ago

Hi again,

the 100 study test failed.... again gedi hangning at the fasqfiltering step.

I used your suggested solution by adding -Duser.home= and -Duser.dir= and still have same issue.

I changed interactive node where the user home folder is available on comoute nodes using /home/$USER and now pipeline doesnt hang even if I run more then 100 alignements.

I think I get previous hangning of gedi because I am missing some variables (beside user.home and user.dir) that are accessed by gedi code in /home/$USER.... might be linked to logging as the fastqfilter log message dont appear....

I thank you for your help, please tell me when you release a new version of gedi where all the gedi variables are read from a config file.

JF