cactus-align memory error?

Overcraft90 commented 2 years ago

Hi @glennhickey,

I've started minigraph-CACTUS without major problems on my HPC cluster. However, at step 3 (cactus-align) the script terminates with an error, and I'm unable to generate the .hal file..., attached the .log file (cactus-align_error.log). This is just the last bit where the error arises if it is not enough I can share the entire .out file.

Now, the first things that comes through my mind is that this might be related to the fact I kept changing the jobstore folder at each step (e.g. jobstore1, 2, 3 etc.) although it doesn't seem very likely.
Second thing I can think off are the resources allocated for the process; I have a single node with 48 threads and 3TB of RAM. Again, this looks to me an unlikely event, still I can give you some insight on the work I'm doing so that you might tell me for sure. I'm assembling a pangenome for 5 human individuals (10 haplotypes) + GRCh38 as reference, could this be the reason despite the large amount of memory allocated in addition to the 48 threads?
Third option involves how "updated" is the architecture for the cluster system I'm using. I've read an issue related to a version for the tool not compatible with very old CPU models. This is what the cluster is currently running (2xCPU x86 Intel Xeon Platinum 8276-8276L — 24c, 2.4Ghz): however, it has been done a recent, major update to the system, so I was wondering whether this could be the reason.

Please, let me know. I might have missed out on something as this is my first experiment with minigraph-CACTUS, if so I'm happy to understand where the issue is. Thanks in advance!

Overcraft90 commented 2 years ago

Hi again,

Did someone faced the same issue or have any recommendation? Thanks.

gwct commented 2 years ago

Hi @Overcraft90, I have had similar errors while running cactus, though I'm not using minigraph but rather the GPU accelerated version of progressive cactus. Regardless it seems like these issues are common in the process of running cactus. See #596, started by someone else and that I've been posting in since I encountered the same type of error regarding disk usage:

LOG-TO-MASTER: Job used more disk than requested. For CWL, consider increasing the outdirMin requirement, otherwise, consider increasing the disk requirement. Job files/for-job/kind-CactusConsolidated/instance-y811o9v3/cleanup/file-b679f7de38d3480ba17fd4f0817df921/stream used 495.90% disk (59.6 GiB [63981568000B] used, 12.0 GiB [12902153700B] requested).

Though I originally thought the issue had to do with the sub-command blossum5, I think the disk usage was the actual problem. In this post of that issue I pointed to some other older issues that seemed similar. While nothing I tried actually resolved this for me, I was lucky that updating to the new version 2.1.1 allowed me to complete my jobs. Though it looks like these types of things are pervasive and that may not be a fix for all situations.

In response to your points:

I don't think changing the jobStore folder will have any effect -- I always change jobStore for each step.
Your resources seem more than sufficient for the job, but this appears to be a disk issue, not a memory one. You should make sure that your --workDir and your system's tmp/ dir both have large amounts of space you can write to. That has helped me resolve these types of things before, since my tmp/ didn't have a lot of space. There is also the --defaultDisk option for many cactus commands I've used, but I haven't seen that tweaking that really affects anything.
I'm not sure about the role the system architecture plays in this problem or the running of cactus in general.

I will also note that in your log file there is also

Exception caught: error parsing sequence CM000663.2

which isn't something I've seen before. I would double check that there is nothing wrong with that sequence just to be sure.

This is just my advice from several months of experience trying to get this to work, so hopefully it is somewhat helpful!

Overcraft90 commented 2 years ago

Hi @gwct,

Thanks a lot that was a very insightful answer, you touched on all my doubts and clarify me the situation. Thanks again!

I'm running the latest version 2.1.1, so I believe the reason for the issue is due to "not enough disk space". Although I'm working in /scratch (where there is a total of 1PB of storage memory), I have already some files there which I could delete for extra space.

In fact, the /scratch directory offers shared space for a total of 1PB, meaning that the actual space I have is related to what other users are doing in that partition..., or at least I believe so.

I will try deleting some files and let you know, thanks.

gwct commented 2 years ago

Hey there, I can't be 100% sure, but to me 1PB definitely seems like enough space, even if it is shared. I think first I would try re-running with --workDir set to your scratch directory. From the help menu of cactus-align:

  --workDir WORKDIR     Absolute path to directory where temporary files generated during the Toil run should be placed. Standard output and error from batch system jobs (unless --noStdOutErr) will be placed in this directory. A cache directory may be placed in this directory.
                        Temp files and folders will be placed in a directory toil-<workflowID> within workDir. The workflowID is generated by Toil and will be reported in the workflow logs. Default is determined by the variables (TMPDIR, TEMP, TMP) via mkdtemp. This directory
                        needs to exist on all machines running jobs; if capturing standard output and error from batch system jobs is desired, it will generally need to be on a shared file system. When sharing a cache between containers on a host, this directory must be shared
                        between the containers.

So, that seems like it would also negate the use of the system /tmp/ dir, but I'm not sure. If that doesn't work I would try manually setting your temporary directory to be somewhere with a lot of space before you run cactus-align:

TMPDIR=/path/to/your/scratch/folder/

Hopefully something like that would work, but again I didn't have any luck with my issue, which was somehow just fixed in newest version.

Overcraft90 commented 2 years ago

Hi again @gwct,

Sure thing. I will do as you said and let you know before deleting any file, which as you might imagine was more of a "last resort" option rather than my first preference.

glennhickey commented 2 years ago

Hi @Overcraft90. Sorry for the late reply. I think this is similar to https://github.com/ComparativeGenomicsToolkit/cactus/issues/745#issuecomment-1198115775 where somehow a fasta is getting wiped out by cactus-preprocess. Would you be able to double check that? I really need to figure out a way to reproduce this to get it fixed. I don't htink it has anything to do with memory btw.

Overcraft90 commented 2 years ago

Hi @glennhickey,

Thanks a lot for getting back to me. I confirmed, in fact, that testing with both the flag --workDir and even changing the $TEMPDIR didn't help anyhow.

Now, it is interesting to understand why such a thing happens; I mean why (and what) of the .fasta files I indicated in the .txt file gets wiped out by the process. In the meantime, I can say I'm in contact with the IT team managing the HPC cluster where I work.

Just now they told me this issue is unlikely to be related to a storage memory limitation. Instead, the problem could be related to the default behaviour of cactus that tries to allocate a default disk space (--defaultDisk) of 200GB, but somehow this might create some conflicts with the flag --pangenome.

Specifically, they proposed that the --pangenome flag could potentially override this value with a lower one of only 2GB. I hope this does make sense and can help somehow, as I'm not aware exactly of what's happening but I would like to put together pieces which might help to address this issue.

glennhickey commented 2 years ago

Did you run cactus-preprocess ? If you did, it should be easy to find the output fastas and check them. Even if you didn't, if you had an empty sequence in your original input, that could explain this error, so that could also be something to check.

Overcraft90 commented 2 years ago

Hi again @glennhickey and thanks for the speedy reply,

I see what you mean; I haven't run cactus-preprocess, but just looking at the .fasta files I can see the header of the reference contains spaces as per the following line:

CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly

Also, compared to the other haplotypes the reference is organised in sequences of 81 bases instead of 61. I'm not sure if any of these things can lead to the aforementioned issue, but I thought it was worth mentioning them just in case.

Overcraft90 commented 2 years ago

Hello,

I'm going to run cactus-preprocess but I'm just wondering what should I give it as an input? I simply indicate the same information I set after cactus-align? e.g.:

cactus-align /g100_scratch/userexternal/mungaro0/pangenomes/1.human-pg/jobstore3 /g100_scratch/userexternal/mungaro0/pangenomes/1.human-pg/human.pg.txt /g100_scratch/userexternal/mungaro0/pangenomes/1.human-pg/human.paf /g100_scratch/userexternal/mungaro0/pangenomes/1.human-pg/human.hal --pangenome --pafInput --outVG --reference GRCh38 --realTimeLogging

But instead of having cactus-align I will have cactus-preprocess? Thanks in advance, I saw a similar post where someone did something close to this, plus some additional output files, so I was wondering whether it is correct.

glennhickey commented 2 years ago

I'd seen a (hopefully extremely rare, but recent github issues have me second guessing this) case where running cactus-preprocess would actually erase a sequence, leading to that halAppendSubtree error. If you hadn't run cactus-preprocess to begin with, then that could not be the problem so no need to rerun it. I think what you would need to do is check all your input fasta files for empty sequences. Something like

>chr1
>chr2
AACCTGT

would be sufficient to cause that error. I'll make sure that the next cactus release handles this better (either a clear error message right away, for fully supporting empty sequences)

Overcraft90 commented 2 years ago

Hi @glennhickey,

Being busy over the past few days in moving over to SC. So, I'm sorry to come back to you late. However, I checked with the following command grep -c '^$' my_fasta_files.fasta whether there was an empty sequence in any of my input files. I couldn't find any missing sequence though..., anything else I can do or maybe another command should I use? Thanks in advance!

glennhickey commented 2 years ago

Looking up the thread, I suppose could be due to spaces in the fasta headers. cactus-preprocess would fail on those.

So one thing to try would be removing the first space and everything after (assuming the first token is unique), then rerunning cactus-align.

Overcraft90 commented 2 years ago

Hi @glennhickey,

I did change the header for the reference from: >CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly to simply: >CM000663.2

At first I got an error because I was just running the cactus-align step with this new _fixheader.fasta, which I realised was due to some problems with the different header entry. So, I re-run all the steps from constructing the minigraph GFA onward but, unfortunately, I incurred in the same error (or at least that's what I think).

If it helps I can post a .log file for the error, let me know. Thanks in advance!

glennhickey commented 2 years ago

How big is your data? If you can share it I could try to reproduce.

Overcraft90 commented 2 years ago

Hi @glennhickey,

I created a Dropbox folder and I shared it with you. Here is the link: https://www.dropbox.com/scl/fo/zgn4yhb9kkdrgnfyc6xmg/h?dl=0&rlkey=9cs1oaeuj07choicy752ins9w

I included all .fasta files and also the original .txt I used to run the cactus-preprocessing.

glennhickey commented 2 years ago

Your grch38 file GRCh38_p14_fixheader.fna.gz has sequence names like

>CM000672.2 Homo sapiens chromosome 10, GRCh38 reference primary assembly

As mentioned previously, the spaces will cause problems (the next version of cactus will handle them better).

Also, there are sequences like:

>KI270917.1 Homo sapiens chromosome 19 genomic contig, GRCh38 reference assembly alternate locus group ALT_REF_LOCI_23

which will not be handled correctly -- the pipeline currently assumes all reference sequences are separate contigs and will not align them together. This is also something that's relatively simple to fix, but isn't done yet, so you need to leave them out.

I highly recommend using a GRCh38 with simple names (like chr1) and no alt contigs. This is what's used in the Cactus readme https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#hprc-graph-setup-and-name-munging

A direct link is https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC_PLUS/GRCh38/assemblies/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

If you use this fasta for the reference, you can use your other sequences for the other samples and still follow all the instructions on the website to build your graph.

Overcraft90 commented 2 years ago

Hi @glennhickey,

My apologies for wasting your time on this issue... I actually went back to my .fasta for GRCh38 and check for all chromosome starts; indeed, all of them have empty sequences. So, I used the following for removing everything but the first sequence of numbers after the ">":

sed '/^>/ s/ .*//' GRCh38_p14.fna > GRCh38_p14_clean.fna

Now, I also realised that despite having selected a file from the NCBI that shouldn't have included the mtDNA, the sequence is present. For this reason, I removed it and tried a run with this new reference genome; I will keep you updated, and if it works I guess we can close the issue.

Again, apologies for wasting your time on this.

glennhickey commented 2 years ago

No worries, the handling of these errors is very user unfriendly in the current release. The next release should be much more robust to fasta header name issues -- it should either work or give a clear error message immediately.

Overcraft90 commented 2 years ago

Hi @glennhickey,

Just wanted to say thank you again. It worked perfectly, right now I'm up to visualizing with sequence tube map and bandage; I think I will close this issue.

ComparativeGenomicsToolkit / cactus

cactus-align memory error? #749