fmalmeida / MpGAP

Multi-platform genome assembly pipeline for Illumina, Nanopore and PacBio reads
https://mpgap.readthedocs.io/en/latest/
GNU General Public License v3.0
53 stars 10 forks source link

quast generating empty files #38

Closed josruirod closed 1 year ago

josruirod commented 1 year ago

So, I've consistently observed that the quast step fails, and even if the pipeline points to the directory of the work, so the files can be checked, these appear to be empty. This is the folder: quast_folder

I attach the logs, and the files that were not 0-sized. Will let you know if it keeps failing during my tests, and if the fix you provided in the config file allows to bypass the erro and avoid the pipeline crash. Thank you so much nextflow.log.txt output.log.txt quast_files.zip

fmalmeida commented 1 year ago

By the logs, it seems that you're trying to run without selecting any of the available configuration profiles for the tools, thus the pipeline is trying to load the tools and execute them from your machine, but it is written in a way to read from the pre-built docker/singularity/conda profile.

Thus, lots of tools are not being found, and the pipeline is failling both in QC and in some assemblers.

Take a look here to select the profile that best suits your needs: https://github.com/fmalmeida/MpGAP/tree/master#selecting-between-profiles

josruirod commented 1 year ago

Silly me. There was even a warning in the log. Thanks! So it seems to be running with no errors no... will let you know if it finish. Cool! Thanks for the kind support

josruirod commented 1 year ago

Hi, so a little update, I've been trying to run the pipeline including the profile as you mentioned (and clearly stated at the instructions, sorry about that). So I settled for the singularity profile because singularity is installed system-wide in our hpc (tried docker with errors, probably because it was not installed in my hpc, and tried conda also with errors, altough I'd say I followed the instructions).

Anyway, with singularity looked good at the beginning, after ~7 hours running (slurm job) wtdbg2, raven and flye seem to have run. But it has also accumulated errors, it seems in quast definitevly after 3 attempts (errors seems to have been ignored though thanks to the code I added to the conf), and in unicycler first attempt. Could you maybe already take a look to see how can we avoid the errors? I've tried to gather the relevant logs in the attached zip, please let me know if I've missed any or if you need anything else. It may be a mess, please tell me if you have any suggestion on how better share the logs or the results. Thanks!! gith.zip

fmalmeida commented 1 year ago

Hi @josruirod, I think it could be due to memory limitations each process called. By default, they ask for very little and the ignore error strategy may be affecting it when they try to call for more after first error.

By default, the assemblers first try to run with 6 cpus and 14 GB memory, and if it fails, they try with the maximum values you've set with --max_cpus and --max_memory. But, maybe the ignore strategy is affecting it, making the pipeline not try again with more resources after first fail.

But, I am not sure. Many of the directories you've sent on your zip file are empty. Make sure to use -r when zipping to zip it recursively. Also, can you send me the .nextflow.log file together with this new zip so I can take a look?

About the profiles:

In general docker a not installed in hpcs, and singularity is the way to go. Conda profile is very tricky to use it and in general I'd say to avoid it. I know it works, but is very tricky to get it working.

On the mean time, while you send me these files and I take a look to understand, please try to run again making sure to:

  1. Use the params-issue branch
  2. Do not tell the pipeline to ignore the errors (so we can better see what's causing it later)
  3. Use singularity profile

This may help us finding the source. I don't think the error is in the pipeline itself. But maybe on the profile or on this resource allocation 😄

josruirod commented 1 year ago

Hi, thank you always for your time and the fast support! Weird about the empty files in the zip. I double checked and these are filled with the files... including the hidden .nextflow.log. Tried again with this tar.gz, does it look good? The folders contain files starting with "." so hidden? Based on the logs, I can confirm this was the params-issue branch and the singularity profile. Just relaunched without the ignore and will let you know.

gith.tar.gz

Thanks.

fmalmeida commented 1 year ago

Now it worked, and I found something:

Process exceeded running time limit (1h)

This is in your .nextflow.log, so, most likely heavy processes were killed due to running time. So, you can already kill the last running I said you to do, and add a new parameter so we can check whether this is the source of the problems.

Try setting --max_time to something bigger. Let's say, 7 h (this I cannot guide you very much ... will need to know the size of your dataset and have a guess of how many time it needs).

The parameter is set as: --max_time '7.h'.

Remember to not ignore the errors. At least for now, so we can have a proper debugging.

Hope this is what's needed 😄

josruirod commented 1 year ago

Got it, but the config already included for this run:

// Max resource options // Defaults only, expecting to be overwritten max_memory = '240.GB' max_cpus = 60 max_time = '72.h'

That's the parameter you were referring to, right? It was already 72h (and did not run for 72 hours before failing). I see it's recorded in the nextflow.log too. So that shouldn't be it? I doubled it just in case, maybe it's the sum for all the processes? I'm using a slurm job limited to 48 hours overall. Maybe it's causing a conflict that it's not the same than the max_time parameter?

I spotted somewhere in the logs the error:

tput: No value for $TERM and no -T specified

Probably totally unrelated, but just in case it's sending something to the background and not setting time right? Something similar was happening with canu and that's why I turned the grid off.

Anyway it's running again, errors not ignored, so maybe we can see when it fails again if there's any more info. I'm running in both hybrid strategy too, so let's see

fmalmeida commented 1 year ago

Interesting. Yes, this is the parameter I was referring. And yes, I saw it now on the log as well. I may have passed too fast haha, sorry 😅

About the slurm config you said, I think that if a conflict have hapenned, it may have not even been launched. But, in general, if you're not allowed to run for more than 48.h, then setting for 72.h will probably not override that (I think).

This tput is more like a warning then an error. Don't think it has nothing to do with it.

Good idea to run both! I hope not ignoring the errors help on getting more info to debug. But the way you're running now seems to already have everything set.

Fingers crossed on this run 🤞🏼

josruirod commented 1 year ago

So with this run I already have errors in unicycler and quast... regarding quast, I see again the

Process exceeded running time limit (1h)

So max_time argument maybe not working? Attached I think the relevant logs for you to check when you have the time, thanks! gith.tar.gz

fmalmeida commented 1 year ago

Is it still running? As I explained for the assemblers, quast also tries to first run with low resources. And the retry with full power if it first fails.

But maybe I am telling them to start with too low. I can try to increase it. But need to check if this is really the issue.

If the pipeline is still running then it will still retry this ones that failed because their first try is with low resources.

fmalmeida commented 1 year ago

I see. It seems that quast is not set for instant retry on the means I said. I will quickly change the config on the params-issue branch, then you can try again, making sure to use -latest to pull all changes.

I tell you when I commit.

josruirod commented 1 year ago

That's great, thanks for such availability! I was reading here and I then checked out the base.confing in your github, and I saw it's there where the time 1h is set for the processes low and ultralow... So that's the reason for the first fail? So I understand the reasoning, but 1 hour is too low? This assembly is around 20-25Mb, not that big? Could I modify manually this, something like adding this in the config I'm giving as input?

process { cpus = 36 memory = 256.GB time = 500.h }

Anyhow, if you mention it should restart with full power and more time then I guess it's going to be done on the second chance.

Waiting for the commit then, thanks!

fmalmeida commented 1 year ago

Yes, all these parameters can be manually modified by the user. If you get my configs and overwrite the resources definitions (making sure to rewrite the withName and withLabel definitions) and give it with -c, nextflow you use yours and not mines.

That being said, the ones I set on the configs are just some sensible defaults so it can run in most of the cases and that we can get the most out of parallelization 😄

But, I agree, 1h is too low and I changed that. And it was nice that you encountered these errors, because on your logs I could see that even though I wanted, quast was not being retried with more power. It was failling on the first fail. That may be the case ... I hope so 🤞🏼

I just commited. You can try it with this new config (with more resource allocation on first try and making sure that quast retries).

Thanking you for reporting, and for your patience on feedbacking and troubleshooting this.

About your comment:

Anyhow, if you mention it should restart with full power and more time then I guess it's going to be done on the second chance.

Yes. On the second try it should go full power. At least that's what I tried to setup, but maybe was not working for quast :)

josruirod commented 1 year ago

Super! Running again, will let you know. Thanks to you for being so available! Happy to help

fmalmeida commented 1 year ago

One note. Remember to use -resume when re-running to avoid redoing sucessfull processes :)

josruirod commented 1 year ago

Oh, noted. Indeed, I was not doing that. I'll do that in the future. So good news is quast was ok, 3 out of 4 and guess the last one would be too, but I see errors in canu and pilon (all the tries) and in unicycler (first try). gith

It's still running to finish the running things, but maybe could you already check? I don't understand the ones regarding canu and pilon (something to do with my hpc file system or mounting?) gith.tar.gz

Hope you can provide any insights, thanks!

fmalmeida commented 1 year ago

Hi @josruirod,

Going step by step.

  1. -resume will resume everything that has not been successfull. Failed ones and does that were not started.
  2. The unicycler error is ok. I checked the log and is related to the first execution, then it was killed and will probably wait to be launched again with full power
  3. The canu and pilon ones are actually interesting ... In all the logs, either in stdout, stderr, or tools loggings, for both tools and all executions, something odd seem to have happenned in your filesystem. Maybe a quick / sudden disconnection, etc, but all of them contain the following message:

Error, do this: mount -t proc proc /proc /dev/fd/62: No such file or directory

Which make me thing that either the /proc have been suddenly / quickly unmounted, or something like that. Then the directories and everything were not accessible anymore, causing them to fail.

Would be good to:

Ps. I am also running here in my cluster with a random dataset to make sure it is nothing with the pipeline or singularity profile.

josruirod commented 1 year ago

Hi there, Sorry it took me long to get back. I was trying to sort this out. So indeed, the test run you suggest in the manual quickstart (should have started with that...) is working perfectly. So yes, it has to be something related with my system. Exporting the environmental variables (NXF_SINGULARITY_LIBRARYDIR, NXF_SINGULARITY_CACHEDIR, SINGULARITY_TMPDIR, SINGULARITY_WORKDIR...) it seems I've made some progress, and got for example canu working. It seems a matter of the load of the hpc, or whether I'm running multiple samples at the same time, because sometimes fails. My hpc IT guys told me the error is in the filesystem the job is trying to use. Int he nodes where I was trying to do the jobs, the root dir filled 100%.

The problem seems to be that the default folder for docker/singularity is /tmp/ (guessing). The hpc uses another directory for tmp files, scratch ($TMPDIR or $LOCAL_SCRATCH) so I have to try and change that. The singularity environmental variables should allow me to change the directory, right?

Anyway, if you have any comment I would be grateful, but since the test run in the quickstart is working perfectly, and you already spent quite some time on this, we can close this issue related to quast, which is already working. Hopefully, I'll get this to work with my data eventually

Thanks for the great support, a pleasure! Best.

fmalmeida commented 1 year ago

Hi @josruirod,

Great to hear that it is not a problem in the code itself and that you could properly execute everything with the quickstart dataset. So, it seems that the solution would be rather more simple once your IT guys manage to understand how set this environements.

Unfortunately, I cannot aid you much on that for two reasos: (1) HPC configurations tend to quite differ between setups and (2) I don't have much experience with it.

But, I really hope that you manage to solve it and can use with your dataset. I will keep the issue open until I merge the modifications we did either for this issue and for issue #36. Once I merge, I close both.

Many thanks for the feedback, for reporting the issues and the kind words.

Best regards.

josruirod commented 1 year ago

Hi, everything you say makes sense, let's see if I can get around my hpc issues.

So I take the chance to ask you another question. It seems I'm very close except for : image

More than 1 day seems like too long? I was monitoring most of the time and cpu and memory usage was kind of low. Always very far from the max I gave as input. I understand the reasoning that full power will only be used after a first round fails, but is there any way the user can control the initial resources? For example, now I'm planning on rerunning with -resume until my hpc behaves, and I would like to give to unicycler and pilon more resources from the beginning. Makes sense?

Thanks!

fmalmeida commented 1 year ago

Hi @josruirod,

Glad to see it is going. And hopefully it will succeed.

Indeed what you said makes sense and this is very straightforward with nextflow. By default, it has a list of priorities and everything you set in a config file with -c will overwrite my defaults as you can see here.

That being said, you can see in this file (https://github.com/fmalmeida/MpGAP/blob/master/conf/base.config) how I am setting resources allocation. And even this resources have some priorities as you can see here: https://www.nextflow.io/docs/latest/config.html#selector-priority

priority: higher > lower withName > withLabel > inside module script > generic/general definitions

So, in order to adjust this, you can either change the resource allocation for the label process_assembly which controls resources for all assemblers and polishers at once. Or, adjust for only a few processes with withName.

For example, you could:

process {
    withName: strategy_2_pilon {
        cpus   = { params.max_cpus   }
        memory = { params.max_memory }
        time   = { params.max_time   }
    }
}

This should set this specific process to allocate everything you set in using the params.

You can understand more about this on nextflow manual: configuration explanation and all available directives for processes

josruirod commented 1 year ago

Thanks you so much! Nextflow is indeed awesome. I'm looking forward to learn more and maybe even prepare some silly pipelines of my own. So I'll try and hopefully will get it to work. So close, it's frustrating to see the nice commands in the ".command.sh" files and the input files ready, and that it fails due to some issue with my hpc and singularity/nextflow. Last question, since the command is provided there, is it crazy to just execute that .sh "manually" (outside nextflow) to get the final results? Guess the pipeline -resume won't detect them, but maybe I could make it work this way?

fmalmeida commented 1 year ago

Hi @josruirod, Yes, you can try to execute the ".sh" or better the ".run" scripts manually. .run lanuches docker image and .sh launches in your machine (you must have softwares installed).

But, anyways, even if you are able to generate the results, the pipeline will much probably not be able to see them.

One question, how are things going? You managed to execute? There are already tickets that I can close once I merge the params-issue branch, right? Don't remember.

😄

josruirod commented 1 year ago

Hi, I'm sorry it took me long to get back, did not get the notification. So I'm afraid I'm still trying to figure it out with our dataset, but it's definitely due to issues with the hpc and nextflow/singularity. The test data worked, and almost all steps are done. So I'd say you can close the tickets and that's all for now. If I keep struggling and need anything from you I'll let you know and open new issues. Thanks for the great support and sorry for the delay!