GATB / DiscoSnp

DiscoSnp is designed for discovering all kinds of SNPs (not only isolated ones), as well as insertions and deletions, from raw set(s) of reads.
https://gatb.inria.fr/software/discosnp/
GNU Affero General Public License v3.0
38 stars 20 forks source link

New release needed #13

Closed coolkom closed 3 years ago

coolkom commented 4 years ago

Dear Pierre,

I've noticed you made some updates in the GitHb version (especially with DiscoSnpRad) but no new release has been done (to use with conda for example, @lecorguille). Would it be possible to make one please? Thanks!

Best, Komlan.

pierrepeterlongo commented 4 years ago

Dear Komlan, You're perfectly right. This is on my todo list. Should be done this week or the next one Best, Pierre

coolkom commented 4 years ago

Hello Pierre, Thanks for replying. I have tested DiscoSnpRad and noticed a couple of bugs I can report here so that you can include modifs in the next release:

1- In the "discoRAD_clustering.sh" file, in order to run the script in case one used the released version or a built one, please replace from line 77, what is between :

Detect the directory path and #################### PARAMETERS VALUES #######################

by

Detect the directory path EDIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd ) if [ -d "$EDIR/../../build/" ] ; then # VERSION SOURCE COMPILED BINDIR=$EDIR"/../../build/bin" else # VERSION BINARY BINDIR=$EDIR"/../../bin" fi rawdiscofile_base=$( basename "${rawdiscofile}" .fa) #################### PARAMETERS VALUES #######################

2- Please comment lines 208 and 209 so that one can use ${disco_simpler}* and ${disco_filtered}.fa files later

3- In the "run_discoSnpRad.sh" script file, on line 527, replace : " kissnp2Cmd="${kissnp2_bin} -in $h5prefix.h5 -out ${kissprefix}_r -b $b $l $x -P $P -D $D $extend $option_cores_gatb $output_coverage_option -coverage_file ${h5prefix}_cov.h5 -max_ambigous_indel ${max_ambigous_indel} -max_symmetrical_crossroads ${option_max_symmetrical_crossroads} -verbose $verbose -max_truncated_path_length_difference ${max_truncated_path_length_difference}" "

by

" kissnp2Cmd="${kissnp2_bin} -in $h5prefix.h5 -out ${kissprefix}_r -b $b $l $x -P $P -D $D $extend $option_cores_gatb $output_coverage_option -coverage_file ${h5prefix}_cov.h5 -max_ambigous_indel ${max_ambigous_indel} -max_symmetrical_crossroads ${option_max_symmetrical_crossroads} -verbose $verbose" "

Since the flag "-max_truncated_path_length_difference" is not in use, this halts the run.

4- Since the -G flag is already (partially) implemented in the run_discoSnpRad.sh" script file, but not in use, I suggest you simply add the following at the end of the file, to create a second vcf file that will use the reference genome. This way, we get a "denovo" & "refmap" vcf files:


####################################################################### #################### Deal with ref mapped VCF ############################### #######################################################################

T="$(date +%s)" echo -e "$yellow ###############################################################" echo -e " #################### CREATE VCF with REF MAP #######################" echo -e " #################################################################### $reset"

if [ -n "$genome" ]; then # A Reference genome is provided vcf_ref_output="${kissprefix}_refmap.vcf" vcfCreatorCmd="$EDIR/../scripts/run_VCF_creator.sh -G $genome -p ${kissprefix}_raw_filtered.fa -o $vcf_ref_output -I $option_cores_post_analysis $e" echo $green$vcfCreatorCmd$cyan if [[ "$wraith" == "false" ]]; then $vcfCreatorCmd fi

if [ $? -ne 0 ]
then
    echo "$red there was a problem with VCF creation. See how to use the \"run_VCF_creator.sh\" alone.$reset"
    exit 1
fi

fi echo $reset T="$(($(date +%s)-T))" if [[ "$wraith" == "false" ]]; then echo -e "${yellow}\t###############################################################" echo -e "\t#################### DISCOSNPRAD FINISHED ######################" echo -e "\t###############################################################" Ttot="$(($(date +%s)-Ttot))" echo "DiscoSnpRad total time in seconds: ${Ttot}" echo -e "\t################################################################################################################" echo -e "\t fasta of predicted variant is \""${kissprefix}_raw.fa"\"" echo -e "\t VCF file for denovo analysis (1-based) is \""${final_output}"\"" echo -e "\t An IGV ready VCF file (sorted by position, only mapped variants, 0-based) is \""${kissprefix}_refmap_for_IGV.vcf"\"" echo -e "\t Thanks for using discoSnpRad - http://colibread.inria.fr/discoSnp/ - Forum: http://www.biostars.org/t/discoSnp/" echo -e "\t################################################################################################################${reset}" fi


Of course, I'm just suggesting, you are in a better position to decide how to implement this best ;-)

Thanks !

Best, Komlan.

pierrepeterlongo commented 4 years ago

Thanks for this suggestions

1- In the "discoRAD_clustering.sh" file, in order to run the script in case one used the released version or a built one, please replace from line 77, what is between :

Detect the directory path and #################### PARAMETERS VALUES #######################

by

Detect the directory path EDIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd ) if [ -d "$EDIR/../../build/" ] ; then # VERSION SOURCE COMPILED BINDIR=$EDIR"/../../build/bin" else # VERSION BINARY BINDIR=$EDIR"/../../bin" fi rawdiscofile_base=$( basename "${rawdiscofile}" .fa) #################### PARAMETERS VALUES ####################### This was missing, thanks a lot. 2- Please comment lines 208 and 209 so that one can use ${disco_simpler}* and ${disco_filtered}.fa files later

I prefer to remove disco_simpler which is useless. I now keep ${disco_filtered}.fa

3- In the "run_discoSnpRad.sh" script file, on line 527, replace : " kissnp2Cmd="${kissnp2_bin} -in $h5prefix.h5 -out ${kissprefix}_r -b $b $l $x -P $P -D $D $extend $option_cores_gatb $output_coverage_option -coverage_file ${h5prefix}_cov.h5 -max_ambigous_indel ${max_ambigous_indel} -max_symmetrical_crossroads ${option_max_symmetrical_crossroads} -verbose $verbose -max_truncated_path_length_difference ${max_truncated_path_length_difference}" "

by

" kissnp2Cmd="${kissnp2_bin} -in $h5prefix.h5 -out ${kissprefix}_r -b $b $l $x -P $P -D $D $extend $option_cores_gatb $output_coverage_option -coverage_file ${h5prefix}_cov.h5 -max_ambigous_indel ${max_ambigous_indel} -max_symmetrical_crossroads ${option_max_symmetrical_crossroads} -verbose $verbose" "

Since the flag "-max_truncated_path_length_difference" is not in use, this halts the run.

I do not understand this point, option max_truncated_path_length_difference is in use in kissplice and it's working.

4- Since the -G flag is already (partially) implemented in the run_discoSnpRad.sh" script file, but not in use, I suggest you simply add the following at the end of the file, to create a second vcf file that will use the reference genome. This way, we get a "denovo" & "refmap" vcf files:

####################################################################### #################### Deal with ref mapped VCF ############################### #######################################################################

T="$(date +%s)" echo -e "$yellow ###############################################################" echo -e " #################### CREATE VCF with REF MAP #######################" echo -e " #################################################################### $reset"

if [ -n "$genome" ]; then # A Reference genome is provided vcf_ref_output="${kissprefix}_refmap.vcf" vcfCreatorCmd="$EDIR/../scripts/run_VCF_creator.sh -G $genome -p ${kissprefix}_raw_filtered.fa -o $vcf_ref_output -I $option_cores_post_analysis $e" echo $green$vcfCreatorCmd$cyan if [[ "$wraith" == "false" ]]; then $vcfCreatorCmd fi

if [ $? -ne 0 ] then echo "$red there was a problem with VCF creation. See how to use the \"run_VCF_creator.sh\" alone.$reset" exit 1 fi

fi echo $reset T="$(($(date +%s)-T))" if [[ "$wraith" == "false" ]]; then echo -e "${yellow}\t###############################################################" echo -e "\t#################### DISCOSNPRAD FINISHED ######################" echo -e "\t###############################################################" Ttot="$(($(date +%s)-Ttot))" echo "DiscoSnpRad total time in seconds: ${Ttot}" echo -e "\t################################################################################################################" echo -e "\t fasta of predicted variant is ""${kissprefix}_raw.fa""" echo -e "\t VCF file for denovo analysis (1-based) is ""${final_output}""" echo -e "\t An IGV ready VCF file (sorted by position, only mapped variants, 0-based) is ""${kissprefix}_refmap_for_IGV.vcf""" echo -e "\t Thanks for using discoSnpRad - http://colibread.inria.fr/discoSnp/ - Forum: http://www.biostars.org/t/discoSnp/" echo -e "\t################################################################################################################${reset}" fi

This is a very nice suggestion. Before to publicly release it, we still have to perform more tests and validations.

coolkom commented 4 years ago

3- In the "run_discoSnpRad.sh" script file, on line 527, replace : " kissnp2Cmd="${kissnp2_bin} -in $h5prefix.h5 -out ${kissprefix}_r -b $b $l $x -P $P -D $D $extend $option_cores_gatb $output_coverage_option -coverage_file ${h5prefix}_cov.h5 -max_ambigous_indel ${max_ambigous_indel} -max_symmetrical_crossroads ${option_max_symmetrical_crossroads} -verbose $verbose -max_truncated_path_length_difference ${max_truncated_path_length_difference}" " by " kissnp2Cmd="${kissnp2_bin} -in $h5prefix.h5 -out ${kissprefix}_r -b $b $l $x -P $P -D $D $extend $option_cores_gatb $output_coverage_option -coverage_file ${h5prefix}_cov.h5 -max_ambigous_indel ${max_ambigous_indel} -max_symmetrical_crossroads ${option_max_symmetrical_crossroads} -verbose $verbose" " Since the flag "-max_truncated_path_length_difference" is not in use, this halts the run.

I do not understand this point, option max_truncated_path_length_difference is in use in kissplice and it's working.

**I think the issue might be in the kissnp2 binary. There, the max_truncated_path_length_difference flag seems not to work. Because my runs fail there and work only when I remove the flag. Maybe you could have a look, I might be missing something...

Thanks !

Best, Komlan.**

coolkom commented 4 years ago

Dear Pierre, Here is another suggestion:

And in the "create_IGV_compatible_VCF.sh" file, on line 32: cat $vcffile|grep -v "#"|sort -k 1,1 -k 2,2n |grep -v "^SNP"|grep -v "^INDEL">>$igvfile # from 2 2 6

In some cases (e.g. large vcf files), the sorting gives an error: "sort: write failed: /tmp/sortHNCRWX: No space left on device"

The default /tmp folder where the temporary sorted files are written might be full. Would it be possible to implement the "-T" flag for "sort" so that the temporary files are saved to the working directory instead (where the created vcf fiiles are saved) since there is more chance that this folder has more space to cope with the situation?

Thanks!

Best, Komlan.

pierrepeterlongo commented 4 years ago

Hi Komlan,

quick update: we are struggling with the CI machine, delaying the creation of the new release. It should be done quickly. Best, Pierre

pierrepeterlongo commented 4 years ago

Hi Komlan,

New release is finally out. Pierre

coolkom commented 4 years ago

Hi Pierre,

Great, thanks ! Best, Komlan.