bcgsc / RAILS

🚝RAILS and 👞🔨Cobbler: Assembly Improvement by Long Sequence Scaffolding/Gap-filling
GNU General Public License v3.0
27 stars 5 forks source link

How to run RAILS #24

Closed ardy20 closed 2 years ago

ardy20 commented 3 years ago

Dear Developer

Could you please provide a better example how to run RAILS? I installed the tool but using the following command I get error:

RAILS> runRAILSminimap.sh MJ_hifiasm_assembly.fa mj.combined.fa 90 0.95 pacbio /sw/Modules/QFAB 24

Usage: runRAILSminimap.sh <FASTA assembly .fa> <FASTA long sequences .fa> <anchoring sequence length eg. 250> <min sequence identity 0.95> <max. softclip eg. 250bp> <min. number of read support eg. 2> <long read type eg.: ont, pacbio, nil>

What the problem should be? Installation? I also installed minimap2 in RAILS main folder and added path both in runRAILSminimap.sh script and my $PATH. I also added path to samtools.

Could you please clarify how to run, e.g. an example of command?

Regards Ardy

warrenlr commented 3 years ago

it would be informative to see the error you are getting (you could post it here).

A priori, it looks like your command isn't supplying all the required arguments, which is why I think it fails, but difficult to know for sure without seeing your error output.

Examples are provided in the ./test folder. I invite you to consult the readme, and take a look at "runmeHuman.sh" on how to run/command on a large genome. The latter run example uses minimap2.

so, to recap, you should be able to run RAILS out of the box by following these steps:

  1. Get RAILS from github (git clone https://github.com/bcgsc/RAILS)
  2. Go to test folder (cd ./RAILS/versions/RAILS_v1.5.1/test)
  3. Run the example (nohup ./runmeHuman.sh &)

this should work provided you have both minimap2 and samtools in your path. Note, you may have to replace the full path to samtools in the runmeHuman.sh with the path on your system. I just tested it and confirm it works with samtools (v1.10) and minimap2 (v2.20-r1061)

ardy20 commented 3 years ago

Dear Rene

Sorry, This is complete log output for my run. The problem occurs at line 69:

(base) user@fl020:.../project/qaafi-cnafs/RAILS> runRAILSminimap.sh 3linear.MJ_hifiasm.fa 3linear.mj.combined.fa 90 0.95 250 2 pacbio /sw/Modules/QFAB/samtools 24 Resolving ambiguous bases -Ns- in 3linear.MJ_hifiasm.fa assembly using long sequences 3linear.mj.combined.fa reformatting file 3linear.MJ_hifiasm.fa WARNING: MAKE SURE YOUR INPUT FASTA IS ONE SEQUENCE PER LINE WITH NO LINE BREAKS! reformatting file 3linear.mj.combined.fa Aligning long sequences 3linear.mj.combined.fa-formatted.fa to your contigs.. Running minimap2 with preset map-pb [M::mm_idx_gen::16.7011.63] collected minimizers [M::mm_idx_gen::17.7472.32] sorted minimizers [M::main::17.7472.32] loaded/built the index for 970 target sequence(s) [M::mm_mapopt_update::18.5362.26] mid_occ = 343 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 970 [M::mm_idx_stat::18.9832.23] distinct minimizers: 43775270 (60.17% are singletons); average occurrences: 2.637; average spacing: 7.864; total length: 907726117 [M::worker_pipeline::106.88619.20] mapped 37688 sequences [M::worker_pipeline::164.33520.71] mapped 37595 sequences [M::worker_pipeline::224.47621.42] mapped 37637 sequences [M::worker_pipeline::282.98221.83] mapped 37696 sequences [M::worker_pipeline::342.80122.10] mapped 37620 sequences [M::worker_pipeline::403.47422.32] mapped 37564 sequences [M::worker_pipeline::463.65022.47] mapped 37534 sequences [M::worker_pipeline::523.05222.58] mapped 37620 sequences [M::worker_pipeline::584.36422.65] mapped 37630 sequences [M::worker_pipeline::646.00222.71] mapped 37589 sequences [M::worker_pipeline::704.10622.77] mapped 37656 sequences [M::worker_pipeline::765.47522.83] mapped 37651 sequences [M::worker_pipeline::826.23322.86] mapped 37520 sequences [M::worker_pipeline::884.82422.90] mapped 37613 sequences [M::worker_pipeline::947.38622.94] mapped 37676 sequences [M::worker_pipeline::1005.96222.97] mapped 37659 sequences [M::worker_pipeline::1065.95722.98] mapped 37384 sequences [M::worker_pipeline::1128.15923.00] mapped 35716 sequences [M::worker_pipeline::1193.45023.02] mapped 35756 sequences [M::worker_pipeline::1252.56023.04] mapped 35688 sequences [M::worker_pipeline::1316.78523.06] mapped 35703 sequences [M::worker_pipeline::1378.13923.08] mapped 35685 sequences [M::worker_pipeline::1438.24423.09] mapped 35664 sequences [M::worker_pipeline::1500.74323.11] mapped 35752 sequences [M::worker_pipeline::1564.62523.13] mapped 35780 sequences [M::worker_pipeline::1624.25923.14] mapped 35795 sequences [M::worker_pipeline::1687.08423.15] mapped 35727 sequences [M::worker_pipeline::1749.96923.17] mapped 35676 sequences [M::worker_pipeline::1810.22523.17] mapped 35682 sequences [M::worker_pipeline::1872.50323.18] mapped 35706 sequences [M::worker_pipeline::1934.31423.19] mapped 35630 sequences [M::worker_pipeline::1995.89623.20] mapped 35708 sequences [M::worker_pipeline::2059.57723.20] mapped 35689 sequences [M::worker_pipeline::2122.31423.21] mapped 35727 sequences [M::worker_pipeline::2183.37723.22] mapped 35668 sequences [M::worker_pipeline::2245.25923.23] mapped 35705 sequences [M::worker_pipeline::2306.36823.23] mapped 35693 sequences [M::worker_pipeline::2368.04423.24] mapped 35676 sequences [M::worker_pipeline::2431.45723.25] mapped 35622 sequences [M::worker_pipeline::2492.39923.25] mapped 35701 sequences [M::worker_pipeline::2555.29923.25] mapped 35687 sequences [M::worker_pipeline::2617.32423.25] mapped 35716 sequences [M::worker_pipeline::2680.33823.26] mapped 35716 sequences [M::worker_pipeline::2743.25923.27] mapped 35745 sequences [M::worker_pipeline::2800.69323.15] mapped 35697 sequences [M::worker_pipeline::2802.84523.13] mapped 3352 sequences [M::main] Version: 2.21-r1071 [M::main] CMD: minimap2 -x map-pb -I50g -N 10 -a -t 24 3linear.MJ_hifiasm.fa-formatted.fa 3linear.mj.combined.fa-formatted.fa [M::main] Real time: 2803.073 sec; CPU: 64828.903 sec; Peak RSS: 16.554 GB Gap-filling 3linear.MJ_hifiasm.fa-formatted.fa using 3linear.mj.combined.fa-formatted.fa Running cobbler.pl -f 3linear.MJ_hifiasm.fa -s 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_gapfilling.fof -l 2 -g 250 -d 90 -i 0.95 -b 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_90ed.fof -p /sw/Modules/QFAB/samtools ... /scratch/project/qaafi-cnafs/RAILS/bin/runRAILSminimap.sh: line 41: ./cobbler.pl: No such file or directory Process terminated. RAILS scaffolding 3linear.MJ_hifiasm.fa.gapsFill.fa sequences and gap-filling using long seqs 3linear.mj.combined.fa -- anchoring sequence threshold 90 bp reformatting file 3linear.MJ_hifiasm.fa.gapsFill.fa cat: 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_90_0.95_gapsFill.fa: No such file or directory Aligning long sequences 3linear.mj.combined.fa-formatted.fa to your contigs.. Running minimap2 with preset map-pb [M::mm_idx_gen::0.0012.29] collected minimizers [M::mm_idx_gen::0.00310.14] sorted minimizers [M::main::0.0039.96] loaded/built the index for 0 target sequence(s) [M::mm_mapopt_update::0.0039.81] mid_occ = 10 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 1; #seq: 0 [M::mm_idx_stat::0.0039.66] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan; total length: 0 [M::worker_pipeline::20.6241.29] mapped 37688 sequences [M::worker_pipeline::39.3751.00] mapped 37595 sequences [M::worker_pipeline::58.2110.90] mapped 37637 sequences [M::worker_pipeline::77.0140.85] mapped 37696 sequences [M::worker_pipeline::95.7270.82] mapped 37620 sequences [M::worker_pipeline::114.4950.80] mapped 37564 sequences [M::worker_pipeline::133.2170.78] mapped 37534 sequences [M::worker_pipeline::152.0480.77] mapped 37620 sequences [M::worker_pipeline::170.8020.76] mapped 37630 sequences [M::worker_pipeline::189.5960.75] mapped 37589 sequences [M::worker_pipeline::208.3400.75] mapped 37656 sequences [M::worker_pipeline::227.2130.74] mapped 37651 sequences [M::worker_pipeline::245.9820.74] mapped 37520 sequences [M::worker_pipeline::264.8220.73] mapped 37613 sequences [M::worker_pipeline::283.7130.73] mapped 37676 sequences [M::worker_pipeline::302.5410.73] mapped 37659 sequences [M::worker_pipeline::321.3210.72] mapped 37384 sequences [M::worker_pipeline::340.3390.72] mapped 35716 sequences [M::worker_pipeline::359.2190.72] mapped 35756 sequences [M::worker_pipeline::378.3040.72] mapped 35688 sequences [M::worker_pipeline::397.0800.72] mapped 35703 sequences [M::worker_pipeline::416.0190.72] mapped 35685 sequences [M::worker_pipeline::434.8430.71] mapped 35664 sequences [M::worker_pipeline::453.7180.71] mapped 35752 sequences [M::worker_pipeline::472.5860.71] mapped 35780 sequences [M::worker_pipeline::491.9050.71] mapped 35795 sequences [M::worker_pipeline::510.7010.71] mapped 35727 sequences [M::worker_pipeline::529.8210.71] mapped 35676 sequences [M::worker_pipeline::548.6900.70] mapped 35682 sequences [M::worker_pipeline::567.8070.70] mapped 35706 sequences [M::worker_pipeline::586.6080.70] mapped 35630 sequences [M::worker_pipeline::605.6650.70] mapped 35708 sequences [M::worker_pipeline::624.5020.70] mapped 35689 sequences [M::worker_pipeline::643.5230.70] mapped 35727 sequences [M::worker_pipeline::662.3810.70] mapped 35668 sequences [M::worker_pipeline::681.2540.70] mapped 35705 sequences [M::worker_pipeline::700.1100.70] mapped 35693 sequences [M::worker_pipeline::718.9120.70] mapped 35676 sequences [M::worker_pipeline::737.9010.70] mapped 35622 sequences [M::worker_pipeline::756.6920.70] mapped 35701 sequences [M::worker_pipeline::775.4980.70] mapped 35687 sequences [M::worker_pipeline::794.3610.70] mapped 35716 sequences [M::worker_pipeline::813.2470.70] mapped 35716 sequences [M::worker_pipeline::831.9910.69] mapped 35745 sequences [M::worker_pipeline::850.6310.68] mapped 35697 sequences [M::worker_pipeline::852.3510.68] mapped 3352 sequences [M::main] Version: 2.21-r1071 [M::main] CMD: minimap2 -x map-pb -I50g -N 10 -a -t 24 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_90_0.95_gapsFill-formatted.fa 3linear.mj.combined.fa-formatted.fa [M::main] Real time: 852.383 sec; CPU: 579.848 sec; Peak RSS: 1.031 GB Scaffolding 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_90_0.95_gapsFill-formatted.fa using 3linear.mj.combined.fa-formatted.fa and filling new gaps with sequences in 3linear.mj.combine Running RAILS -f 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_90_0.95_gapsFill-formatted.fa -s 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_scaffolding.fof -l 2 -g 250 -d 90 -i 0.95 -a_90_0.95_rails -q 3linear.mj.combined.fa-formatted.fof -p /sw/Modules/QFAB/samtools ... /scratch/project/qaafi-cnafs/RAILS/bin/runRAILSminimap.sh: line 69: ./RAILS: No such file or directory RAILS process terminated.

Result files:-----------------------------------------

ll -rw-r--r-- 1 user S_QAAFI_CNAFS 22606224886 Aug 6 00:13 3linear.mj.combined.fa -rw-r--r-- 1 user S_QAAFI_CNAFS 22576386886 Aug 6 16:08 3linear.mj.combined.fa-formatted.fa -rw-r--r-- 1 user S_QAAFI_CNAFS 0 Aug 6 16:55 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_90_0.95_gapsFill-formatted.fa -rw-r--r-- 1 user S_QAAFI_CNAFS 907738727 Aug 6 00:03 3linear.MJ_hifiasm.fa -rw-r--r-- 1 user S_QAAFI_CNAFS 907743779 Aug 6 16:08 3linear.MJ_hifiasm.fa-formatted.fa drwxr-sr-x 2 user S_QAAFI_CNAFS 4096 Jan 5 2021 bin -rw-r--r-- 1 user S_QAAFI_CNAFS 462 Jan 5 2021 Dockerfile -rw-r--r-- 1 user S_QAAFI_CNAFS 35064 Jan 5 2021 LICENSE drwxr-sr-x 10 user S_QAAFI_CNAFS 4096 Aug 5 19:34 minimap2 -rw-r--r-- 1 user S_QAAFI_CNAFS 22981164883 Aug 5 19:32 mj.combined.fa -rw-r--r-- 1 user S_QAAFI_CNAFS 919084820 Aug 5 19:32 MJ_hifiasm_assembly.fa drwxr-sr-x 2 user S_QAAFI_CNAFS 4096 Jan 5 2021 paper -rw-r--r-- 1 user S_QAAFI_CNAFS 24801 Jan 5 2021 rails-logo.png -rw-r--r-- 1 user S_QAAFI_CNAFS 13715 Jan 5 2021 readme.md drwxr-sr-x 2 user S_QAAFI_CNAFS 4096 Jan 5 2021 tarball drwxr-sr-x 10 user S_QAAFI_CNAFS 4096 Jan 5 2021 versions

ardy20 commented 3 years ago

Hi

I installed bwa and added it into my path and it is running now. I keep you posted when I got the results.

ardy20 commented 3 years ago

Hi Rene

The runRAILSminimap.sh went very well but weirdly the number of scaffold increased. The initial number of scaffolds was (from HiFi read and HiFiasm) 780 but increased to 970 after using RAILSminimap.sh. The N50 also was decreased from 46 to 24 Mb. What the problem might be?

warrenlr commented 3 years ago

this is for *rails.scaffolds.fa ? puzzling... I don't see how the scaffolder would yield more scaffolds than the baseline's. Could you post the filenames and their n, L50 and N50 here? Did you also run the human test and what did you get?

(base) [rwarren@hpce704 test]$ abyss-fac -t 500 reads.fa_vs_hsapiens-8.fa_250_0.95_*fa
n   n:500   L50 min N75 N50 N25 E-size  max sum name
966238  65905   145 500 2686274 5667622 9910837 7180150 26.67e6 2.828e9 reads.fa_vs_hsapiens-8.fa_250_0.95_gapsFill.fa
966238  65905   145 500 2686274 5667622 9910837 7180150 26.67e6 2.828e9 reads.fa_vs_hsapiens-8.fa_250_0.95_gapsFill-formatted.fa
958260  63692   109 500 3543494 7320241 14.39e6 9488373 34.78e6 2.838e9 reads.fa_vs_hsapiens-8.fa_250_0.95_rails.scaffolds.fa
ardy20 commented 3 years ago

Hi Rene Yes, this is for rails.scaffilds. I though this poor results might be due to the inherent of HiFi data or relaxation of of minimap2.

I did not run on human test data.

Unfortunately, I deleted the RAILSminimap.sh result files after seeing low N50s and currently testing ./RAILS.sh using bwa. It is much slower. I also will test it on my IPA (improved phase assembly) which only has 300 contigs and keep you posted. I am keen to get this tool working on plant genomes.

ardy20 commented 3 years ago

Dear Rene

I tested runRAILS.sh which uses bwa. The number of scaffold of input assembly was ~780 The list of files and also result of Quast are as follows:

S_QAAFI_CNAFS 22606224886 Aug 6 00:13 3linear.mj.combined.fa S_QAAFI_CNAFS 22576386886 Aug 7 10:17 3linear.mj.combined.fa-formatted.fa S_QAAFI_CNAFS 36 Aug 8 04:12 3linear.mj.combined.fa-formatted.fof S_QAAFI_CNAFS 907738727 Aug 7 19:15 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill.fa S_QAAFI_CNAFS 907740869 Aug 7 19:15 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill-formatted.fa S_QAAFI_CNAFS 16 Aug 7 19:26 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill-formatted.fa.amb S_QAAFI_CNAFS 39316 Aug 7 19:26 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill-formatted.fa.ann S_QAAFI_CNAFS 907726204 Aug 7 19:26 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill-formatted.fa.bwt S_QAAFI_CNAFS 226931531 Aug 7 19:26 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill-formatted.fa.pac S_QAAFI_CNAFS 453863112 Aug 7 19:30 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill-formatted.fa.sa S_QAAFI_CNAFS 7712 Aug 7 19:15 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill-list.tsv S_QAAFI_CNAFS 569 Aug 7 19:15 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_gapsFill.log S_QAAFI_CNAFS 1224 Aug 8 04:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_rails.log S_QAAFI_CNAFS 0 Aug 8 04:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_rails.pairing_distribution.csv S_QAAFI_CNAFS 78 Aug 8 04:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_rails.pairing_issues S_QAAFI_CNAFS 28534 Aug 8 04:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_rails.scaffolds S_QAAFI_CNAFS 907774253 Aug 8 04:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_rails.scaffolds.fa S_QAAFI_CNAFS 0 Aug 8 04:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_250_0.95_rails.scaffolds_GAPseqList.txt S_QAAFI_CNAFS 10666421551 Aug 7 19:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_gapfilling.bam S_QAAFI_CNAFS 46 Aug 7 06:44 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_gapfilling.bam.bampreprocessor.err.log266451628282667 S_QAAFI_CNAFS 46 Aug 7 19:15 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_gapfilling.bam.bampreprocessor.err.log289951628327738 S_QAAFI_CNAFS 63 Aug 7 19:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_gapfilling.fof S_QAAFI_CNAFS 10665617309 Aug 8 04:12 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_scaffolding.bam S_QAAFI_CNAFS 46 Aug 8 04:14 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_scaffolding.bam.bampreprocessor.err.log192851628360056 S_QAAFI_CNAFS 64 Aug 8 04:12 3linear.mj.combined.fa_vs_3linear.MJ_hifiasm.fa_scaffolding.fof S_QAAFI_CNAFS 907738727 Aug 6 00:03 3linear.MJ_hifiasm.fa S_QAAFI_CNAFS 907743779 Aug 7 10:16 3linear.MJ_hifiasm.fa-formatted.fa S_QAAFI_CNAFS 16 Aug 7 10:28 3linear.MJ_hifiasm.fa-formatted.fa.amb S_QAAFI_CNAFS 41256 Aug 7 10:28 3linear.MJ_hifiasm.fa-formatted.fa.ann S_QAAFI_CNAFS 907726204 Aug 7 10:28 3linear.MJ_hifiasm.fa-formatted.fa.bwt S_QAAFI_CNAFS 226931531 Aug 7 10:28 3linear.MJ_hifiasm.fa-formatted.fa.pac S_QAAFI_CNAFS 453863112 Aug 7 10:32 3linear.MJ_hifiasm.fa-formatted.fa.sa

QUAST Assembly 3linear_mj_combined_fa_vs_3linear_MJ_hifiasm_fa_250_0_95_rails_scaffolds_fa

contigs (>= 0 bp) 970

contigs (>= 1000 bp) 970

Total length (>= 0 bp) 907726117 Total length (>= 1000 bp) 907726117

contigs 970

Largest contig 49863231 Total length 907726117 GC (%) 40.06 N50 24506025 N75 13795436 L50 13 L75 26

N's per 100 kbp 0.00

ardy20 commented 3 years ago

Running RIALSminimap.sh on my IPA assembly created the following error. The soft link did not work.

Running: ./RAILS [v1.5.1] -f joF2.combined.fa_vs_JoF2.line.asm.fa_90_0.95_gapsFill-formatted.fa -q joF2.combined.fa-formatted.fof -s joF2.combined.fa_vs_JoF2.line.asm.fa_scaffolding.fof -d 90 -i 0.95 -e 1 -l 2 -a 0.99 -g 250 -t

=>Reading bam: Sun Aug 8 17:06:31 AEST 2021 /scratch/project/qaafi-cnafs/RAILS/bin/runRAILSminimap.sh: line 71: 9124 Killed ./RAILS -f $2vs$1$3$4_gapsFill-formatted.fa -s $2vs$1_scaffolding.fof -l $6 -g $5 -d $3 -i $4 -b $2vs$1$3$4_rails -q $2-formatted.fof -p $8 RAILS process terminated.

Running RAILS again:

./RAILS -f $2vs$1$3$4_gapsFill-formatted.fa -s $2vs$1_scaffolding.fof -l $6 -g $5 -d $3 -i $4 -b $2vs$1$3$4_rails -q $2-formatted.fof -p $8 Invalid file: _vs____gapsFill-formatted.fa -- fatal

warrenlr commented 3 years ago

Hi Ardy,

The human test data provided is useful for a few reasons:

  1. to test that your installation works
  2. to ensure the results are consistent with what is expected, but on your system. It also provides users with a protocol for running the software and is easily replicated by swapping the files for yours.

I still don't understand why there are more scaffolds after scaffolding then you began with*, and could be due to a formatting issue with your input files. This is especially since this the tool has been used in a number of 3rd party studies/publications since 2016 and this is the first report of it with a strange behaviour. For that reason, I think it is imperative that you run the test provided and confirm that you are getting the same results as I posted. Next, I would suggest you ensure that your files are formatted to use on a unix platform (the dos newline may cause issues that are sometimes difficult to diagnose). We find seqtk useful for fasta/fastq file formatting. I think this calls for systematic troubleshooting to identify the root cause of this on your particular system/setup.

*for instance, it may be worth tracking where these additional scaffolds come from, etc.

good luck Rene

ardy20 commented 3 years ago

Hi Rene

Thanks for suggestions. Currently, I am running my IPA assembly (with hifi reads). When I got results, I run human test. I used the following commands to linearise my input files:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > out.fa

This creates an empty line at the beginning that should be removed with:

tail -n +2 filein.fa > fileout.fa

We widely use seqtk in our lab and worth to try it in this case.

I thought the problem might be linked to the hifi reads (ccs.fasta) that I use for gapfilling. This is the same file that I used for the assembly. I thought it would be good if I use a separate dataset for this purpose, for example CLR or ONT, with longer read lengths (<35 Kb).

I keep you posted.

Regards Ardy