bcgsc / goldrush

Linear-time de novo Long Read Assembler
GNU General Public License v3.0
34 stars 2 forks source link

Unable to run simulated human HiFi reads #138

Open Oieswarya opened 1 week ago

Oieswarya commented 1 week ago

Hello, I have been trying to run goldrush with simulated HiFi reads of Human. The coverage of the reads is 10x. I have used goldrush for several other simulated inputs and it ran. I also checked if there is any non actg characters on my fq file and found none.

I have used this command: goldrush run reads=Human_nonACTG_fq G=3120e6 track_time=1 m=10000 --debug

This is the .out file:

GNU Make 4.3 Built for x86_64-conda-linux-gnu Copyright (C) 1988-2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Reading makefiles... Updating makefiles.... Updating goal targets.... File 'run' does not exist. Must remake target 'run'. mkdir -p goldrush_intermediate_files cd goldrush_intermediate_files && ln -sf ../Human_nonACTG_fq.fq && goldrush run-in-dir reads=Human_nonACTG_fq G=3120e6 t=48 z=1000 track_time=1 k=22 w=16 tile=1000 b=10 u=5 a=1 o=0.1 x=10 h=3 s=1011011110110111101101 m=10000 M=5 r=0.9 P=15 d=5 span=2 dist=500 k_ntLink=40 w_ntLink=250 rounds=5 polisher=goldpolish polisher_mapper=minimap2 shared_mem=/dev/shm GNU Make 4.3 Built for x86_64-conda-linux-gnu Copyright (C) 1988-2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Reading makefiles... make[1]: Entering directory '/home/goldrush_intermediate_files' Updating makefiles.... Updating goal targets.... File 'run-in-dir' does not exist. File 'check-G' does not exist. Must remake target 'check-G'. Successfully remade target file 'check-G'. File 'check-reads' does not exist. Must remake target 'check-reads'. Successfully remade target file 'check-reads'. File 'clean' does not exist. File 'goldrush_asm_golden_path.fa' does not exist. File 'goldrush_asm_silver_path_all.fq' does not exist. File 'goldrush_asm_silver_path_5.fq' does not exist. Must remake target 'goldrush_asm_silver_path_5.fq'. command time -v -o goldrush_asm_silver_path_5.fq.time goldrush-path -k 22 -w 16 -t 1000 -u 5 -a 1 -o 0.1 -p goldrush_asm_silver_path -i Human_nonACTG_fq.fq -h 3 -j 48 -x10 -P 15 -d 5 -s 1011011110110111101101 -g 3120e6 -b 10 -r 0.9 --silver_path -M 5 -m 10000 --verbose make[1]: Leaving directory '/home/goldrush_intermediate_files'

This is the .err file: make[1]: [/home/.conda/envs/goldrush_env/bin/goldrush.make:251: goldrush_asm_silver_path_5.fq] Error 127 make: [/home/.conda/envs/goldrush_env/bin/goldrush.make:203: run] Error 2

Can you kindly guide me as to where I am going wrong.

lcoombe commented 1 week ago

Hi @Oieswarya,

Is that the full standard out and error? After goldrush-path starts, there will be some messages about the parameters, etc. and don't see those there.

Can you confirm you are using exactly the same environment and installation as past runs? Do you see the help page when you run goldrush-path --help?

Thank you for your interest in GoldRush! Lauren

th-of commented 1 week ago

Also having a similar issue trying to run goldrush, ubuntu 20.04, 22.04, WSL2, all machines have the same error when installed with conda.

ln: ./..: cannot overwrite directory make: *** [/home/thomas-ws/miniconda3/bin/goldrush.make:203: run] Error 1

I haven't been able to build it from source yet because of missing shared libraries that I can't figure out.

lcoombe commented 1 week ago

Hi @th-of,

This looks like a different error/issue - would you mind opening a new GitHub issue so we can keep our discussions separate? In particular, we would want to see your command and full log (standard out and error), as well as the result of running our assembly demo.

th-of commented 1 week ago

Hi @th-of,

This looks like a different error/issue - would you mind opening a new GitHub issue so we can keep our discussions separate? In particular, we would want to see your command and full log (standard out and error), as well as the result of running our assembly demo.

As I was reproducing my issue I found the problem, I was including the file extension in the reads name ("reads=reads.fastq"). However, that only changed one problem to another (see below). I will spend some more time trying to fix it before I make an issue for this one.

SeqIndex::SeqIndex: Loading index from some/path/reads.fastq.index terminate called after throwing an instance of 'std::invalid_argument' what(): stoul goldrush/bin/goldrush.make:259: goldrush_asm_golden_path.goldpolish-polished.fa] Error 143

lcoombe commented 1 week ago

Sounds good, @th-of. I'll look out for your fresh issue. It's hard to say too much without more information from you, but just a reminder to test your installation using the assembly demo, and ensure that your input read file is in your current working directory.

th-of commented 1 week ago

Sounds good, @th-of. I'll look out for your fresh issue. It's hard to say too much without more information from you, but just a reminder to test your installation using the assembly demo, and ensure that your input read file is in your current working directory.

All is working now! Although one of the steps in the goldrush pipeline (goldpolish?) appears to be incompatible with fastq files from Dorado. The formatting of the header line seems to be the problem. The fastq file causes an error with header:

@b4fe3a55-f963-4a43-88d1-35b23acdbdc7 st:Z:2024-03-14T09:22:04.350+00:00 RG:Z:ccf17720be1a9a9f8f33443ea90c42b6a7685e7f_dna_r10.4.1_e8.2_400bps_hac@v5.0.0 DS:Z:gpu:NVIDIA GeForce RTX 3090

If I rename all the headers in the fastq file to a single word it runs without problems. Probably doesn't account for a tab-separated list as a fastq header. Dorado generates this by default when basecalling ONT pod5 files to fastq.

lcoombe commented 1 week ago

Glad it's working for you now! Huh strange - We've tested reads from Dorado before, but perhaps not with this header format. Thanks for that info, we'll take a look at fixing that.

Oieswarya commented 1 week ago

Hi @lcoombe, yes strangely I have not changed anything and also goldrush-path --help gives me all the information from the help page. I am using the job script that I used to submit my previous jobs. I am using 370GB memory which should be more than enough, but do you think it is a memory issue?

I also checked the headers of my fastq file and they are single words like @1 and so on.

lcoombe commented 1 week ago

Thanks for confirming, @Oieswarya!

Looking at your command, your target genome looks to be ~3Gbp, so yes that should be enough memory.

In the goldrush_intermediate_files directory could you try just directly runing the command that looks to have failed?

command time -v -o goldrush_asm_silver_path_5.fq.time goldrush-path -k 22 -w 16 -t 1000 -u 5 -a 1 -o 0.1 -p goldrush_asm_silver_path -i Human_nonACTG_fq.fq -h 3 -j 48 -x10 -P 15 -d 5 -s 1011011110110111101101 -g 3120e6 -b 10 -r 0.9 --silver_path -M 5 -m 10000 --verbose

It would be super helpful to get more log messages from that command - it would be strange to just immediately fail without writing any of it's regular messages to log, if the binary itself seems OK (as indicated by you seeing the help page just fine)

Oieswarya commented 1 week ago

@lcoombe I wanted to update you. I have run the command separately from goldrush_immediate_files and this is the log file that generated: Using preset spaced seed with: span: 22 weight: 16 Calculating 5 silver path(s) Using: tile length: 1000 block size: 10 seed patterns: 3 threshold: 10 base seed pattern: 1011011110110111101101 minimum unassigned tiles: 5 maximum assigned tiles: 1 expected hash space: 6442450944 minimum average phred quality score: 15 maximum average phred delta between first and second half of read: 5 occupancy: 0.1 jobs: 48 allocating bit vector m_filterSize: 61146729472 finished allocating bit vector in 2.2652 opening: Human_nonACTG_fq.fq inserting bit vector num_passed_reads: 1597712 num_reads: 3083048 num_reads - num_passed_reads: 1485336 num_reads - num_passed_reads / num_reads: 0.0000 num_reads_skipped_by_phred: 0 num_reads_skipped_by_delta: 0 num_reads_skipped_by_length: 1485336 Total reads skipped: 1485336 finished inserting bit vector in 1391.7048 assigning tiles processed 10000 reads processed 20000 reads processed 30000 reads processed 40000 reads processed 50000 reads processed 60000 reads processed 70000 reads processed 80000 reads processed 90000 reads processed 100000 reads processed 110000 reads processed 120000 reads processed 130000 reads processed 140000 reads processed 150000 reads processed 160000 reads processed 170000 reads processed 180000 reads processed 190000 reads processed 200000 reads processed 210000 reads processed 220000 reads processed 230000 reads processed 240000 reads processed 250000 reads processed 260000 reads processed 270000 reads processed 280000 reads processed 290000 reads processed 300000 reads processed 310000 reads processed 320000 reads processed 330000 reads processed 340000 reads processed 350000 reads processed 360000 reads processed 370000 reads processed 380000 reads processed 390000 reads processed 400000 reads processed 410000 reads processed 420000 reads processed 430000 reads processed 440000 reads processed 450000 reads processed 460000 reads processed 470000 reads processed 480000 reads processed 490000 reads processed 500000 reads processed 510000 reads processed 520000 reads processed 530000 reads processed 540000 reads processed 550000 reads processed 560000 reads processed 570000 reads processed 580000 reads processed 590000 reads processed 600000 reads processed 610000 reads processed 620000 reads processed 630000 reads processed 640000 reads processed 650000 reads processed 660000 reads processed 670000 reads processed 680000 reads processed 690000 reads processed 700000 reads processed 710000 reads processed 720000 reads processed 730000 reads processed 740000 reads processed 750000 reads processed 760000 reads processed 770000 reads processed 780000 reads processed 790000 reads processed 800000 reads processed 810000 reads processed 820000 reads processed 830000 reads processed 840000 reads processed 850000 reads processed 860000 reads processed 870000 reads processed 880000 reads processed 890000 reads processed 900000 reads processed 910000 reads processed 920000 reads processed 930000 reads processed 940000 reads processed 950000 reads processed 960000 reads processed 970000 reads processed 980000 reads processed 990000 reads processed 1000000 reads processed 1010000 reads processed 1020000 reads processed 1030000 reads processed 1040000 reads processed 1050000 reads processed 1060000 reads processed 1070000 reads processed 1080000 reads processed 1090000 reads processed 1100000 reads processed 1110000 reads processed 1120000 reads processed 1130000 reads processed 1140000 reads processed 1150000 reads processed 1160000 reads processed 1170000 reads processed 1180000 reads processed 1190000 reads processed 1200000 reads processed 1210000 reads processed 1220000 reads processed 1230000 reads Visited 642632 reads to generate 1 silver paths processed 1240000 reads processed 1250000 reads processed 1260000 reads processed 1270000 reads processed 1280000 reads processed 1290000 reads processed 1300000 reads processed 1310000 reads processed 1320000 reads processed 1330000 reads processed 1340000 reads processed 1350000 reads processed 1360000 reads processed 1370000 reads processed 1380000 reads processed 1390000 reads processed 1400000 reads processed 1410000 reads processed 1420000 reads processed 1430000 reads processed 1440000 reads processed 1450000 reads processed 1460000 reads processed 1470000 reads processed 1480000 reads processed 1490000 reads processed 1500000 reads processed 1510000 reads processed 1520000 reads processed 1530000 reads processed 1540000 reads processed 1550000 reads processed 1560000 reads processed 1570000 reads processed 1580000 reads processed 1590000 reads processed 1600000 reads processed 1610000 reads processed 1620000 reads processed 1630000 reads processed 1640000 reads processed 1650000 reads processed 1660000 reads processed 1670000 reads processed 1680000 reads processed 1690000 reads processed 1700000 reads processed 1710000 reads processed 1720000 reads processed 1730000 reads processed 1740000 reads processed 1750000 reads processed 1760000 reads processed 1770000 reads processed 1780000 reads processed 1790000 reads processed 1800000 reads processed 1810000 reads processed 1820000 reads processed 1830000 reads processed 1840000 reads processed 1850000 reads processed 1860000 reads processed 1870000 reads processed 1880000 reads processed 1890000 reads processed 1900000 reads processed 1910000 reads processed 1920000 reads processed 1930000 reads processed 1940000 reads processed 1950000 reads processed 1960000 reads processed 1970000 reads processed 1980000 reads processed 1990000 reads processed 2000000 reads processed 2010000 reads processed 2020000 reads processed 2030000 reads processed 2040000 reads processed 2050000 reads processed 2060000 reads processed 2070000 reads processed 2080000 reads processed 2090000 reads processed 2100000 reads processed 2110000 reads processed 2120000 reads processed 2130000 reads processed 2140000 reads processed 2150000 reads processed 2160000 reads processed 2170000 reads processed 2180000 reads processed 2190000 reads processed 2200000 reads processed 2210000 reads processed 2220000 reads processed 2230000 reads processed 2240000 reads processed 2250000 reads processed 2260000 reads processed 2270000 reads processed 2280000 reads processed 2290000 reads processed 2300000 reads processed 2310000 reads processed 2320000 reads processed 2330000 reads processed 2340000 reads processed 2350000 reads processed 2360000 reads processed 2370000 reads processed 2380000 reads processed 2390000 reads processed 2400000 reads processed 2410000 reads processed 2420000 reads processed 2430000 reads processed 2440000 reads processed 2450000 reads processed 2460000 reads processed 2470000 reads processed 2480000 reads processed 2490000 reads processed 2500000 reads processed 2510000 reads processed 2520000 reads processed 2530000 reads processed 2540000 reads processed 2550000 reads processed 2560000 reads Visited 1330621 reads to generate 2 silver paths processed 2570000 reads processed 2580000 reads processed 2590000 reads processed 2600000 reads processed 2610000 reads processed 2620000 reads processed 2630000 reads processed 2640000 reads processed 2650000 reads processed 2660000 reads processed 2670000 reads processed 2680000 reads processed 2690000 reads processed 2700000 reads processed 2710000 reads processed 2720000 reads processed 2730000 reads processed 2740000 reads processed 2750000 reads processed 2760000 reads processed 2770000 reads processed 2780000 reads processed 2790000 reads processed 2800000 reads processed 2810000 reads processed 2820000 reads processed 2830000 reads processed 2840000 reads processed 2850000 reads processed 2860000 reads processed 2870000 reads processed 2880000 reads processed 2890000 reads processed 2900000 reads processed 2910000 reads processed 2920000 reads processed 2930000 reads processed 2940000 reads processed 2950000 reads processed 2960000 reads processed 2970000 reads processed 2980000 reads processed 2990000 reads processed 3000000 reads processed 3010000 reads processed 3020000 reads processed 3030000 reads processed 3040000 reads processed 3050000 reads processed 3060000 reads processed 3070000 reads processed 3080000 reads WARNING: Expected 5 silver paths, but only 3 generated. Possible reasons include:

lcoombe commented 1 week ago

Hi @Oieswarya,

So that indicates that the run went just fine (I don't see any errors), so unsure why you were getting that error before? You could try re-launching the same command, which should now start after that goldrush-path step. You can confirm that by running the same command with the dry-run option (-n).

Oieswarya commented 1 week ago

@lcoombe shall I run this command now?

goldrush run reads=Human_nonACTG_fq G=3120e6 track_time=1 m=10000 --debug

lcoombe commented 1 week ago

That's right! Fingers crossed it'll work - if so, it could have been a transient server issue.

Oieswarya commented 1 week ago

I am still getting the same error: GNU Make 4.3 Built for x86_64-conda-linux-gnu Copyright (C) 1988-2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Reading makefiles... Updating makefiles.... Updating goal targets.... File 'run' does not exist. Must remake target 'run'. mkdir -p goldrush_intermediate_files cd goldrush_intermediate_files && ln -sf ../Human_nonACTG_fq.fq && goldrush run-in-dir reads=Human_nonACTG_fq G=3120e6 t=48 z=1000 track_time=1 k=22 w=16 tile=1000 b=10 u=5 a=1 o=0.1 x=10 h=3 s=1011011110110111101101 m=10000 M=5 r=0.9 P=15 d=5 span=2 dist=500 k_ntLink=40 w_ntLink=250 rounds=5 polisher=goldpolish polisher_mapper=minimap2 shared_mem=/dev/shm GNU Make 4.3 Built for x86_64-conda-linux-gnu Copyright (C) 1988-2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Reading makefiles... make[1]: Entering directory '/home/goldrush_intermediate_files' Updating makefiles.... Updating goal targets.... File 'run-in-dir' does not exist. File 'check-G' does not exist. Must remake target 'check-G'. Successfully remade target file 'check-G'. File 'check-reads' does not exist. Must remake target 'check-reads'. Successfully remade target file 'check-reads'. File 'clean' does not exist. File 'goldrush_asm_golden_path.fa' does not exist. File 'goldrush_asm_silver_path_all.fq' does not exist. File 'goldrush_asm_silver_path_5.fq' does not exist. Must remake target 'goldrush_asm_silver_path_5.fq'. command time -v -o goldrush_asm_silver_path_5.fq.time goldrush-path -k 22 -w 16 -t 1000 -u 5 -a 1 -o 0.1 -p goldrush_asm_silver_path -i Human_nonACTG_fq.fq -h 3 -j 48 -x10 -P 15 -d 5 -s 1011011110110111101101 -g 3120e6 -b 10 -r 0.9 --silver_path -M 5 -m 10000 --verbose make[1]: Leaving directory '/home/goldrush_intermediate_files'

When I ran the goldrush-path command, though it was running but I did not see any files in the folder nor any soft links which it usually produces.

lcoombe commented 1 week ago

Are you running that command in the same folder? It doesn't appear to be starting in the right place (ie. it is re-running the goldrush-path command) - but regardless, I can't really see any error there - could you attach the full log files to GitHub?

In addition, could you re-run a fresh demo with your current set-up, just to make sure that nothing happened with your environment or server that you're usingg?

Oieswarya commented 1 week ago

Yes I am running both the commands from the same environment where I installed my goldrush.

I will try to upload the file but unsure if I can do that as it is a 61gb file.

I will run goldrush with another file (which previously successfully ran) and see if there is something wrong with the installation somehow.

Thank you for your prompt responses!

lcoombe commented 1 week ago

No worries!

Just to clarify - I was asking about you sharing your full log files from the failed run, not your reads :)

And I know you're using the same environment, but always good to do that fresh demo run as a sanity check - running a previously successful read set works too!