Seeking advice for splitting the run of a 6G diploid assembly

rsharris commented 2 years ago

What do you want to know?

What's would be a good way to run RepeatMasker, on a slurm cluster, for a 6G diploid primate assembly.

Helpful context

The only output I need is the .out file. I am running RepeatMasker because another pipeline (AmpliCOnE) requires me to and all it wants (from RepeatMasker) is that file.

Is there a particular genome assembly or organism your question is about? If possible, please provide a link to a publicly available assembly and/or a species name.

The assembly isn't yet publicly available. It is a diploid assembly of a male primate, with about 1,500 different sequences. Longest is ≈190Mbp, shortest is ≈10Kbp.

Have you installed RepBase RepeatMasker Edition for RepeatMasker?

I have installed RepeatMasker via conda, and as a test I have successfully run it on a small sample (200Kbp) of a different primate (gorilla). I am using the "-species primates" option.

I see in issue 2 the recommendation is to split jobs into 1MBp pieces. I'm concerned about whether repeat elements crossing a boundary could go undiscovered. But if I use overlapping pieces, (a) how much overlap should I use, and (b) how should I merge all the .out files into a single .out file. Presumably I'd need to check for overlapping repeat elements and merge them.

rmhubley commented 2 years ago

Hi, sorry for the late response. Is this still an issue? On several clusters I use Nextflow to run RepeatMasker on 50MB non-overlapping batches with nodes that have at least 12 cores and using the (-pa 12) option. For SLURM this would be "--tasks=1 -N 1 --cpus-per-task=12". As you point out I do run the risk of missing a repeat every 50MB (right on the boundary) but more likely it will miss only repeats that extend insignificantly (<50bp) on one side or the other. Rarely would you have one that spans the boundry with insignificant alignments on both sides but it can happen. If your application requires that you do not miss these, then I would hand run the boundaries separately to avoid the hassle of dealing with overlaps in your primary run. The Nextflow workflow also handles batching and rejoining the batches and I am happy to share it with you.

rsharris commented 2 years ago

Howdy,

Well I'm not sure whether this is still an issue or not. I'd forgotten about it, and today went digging back through my notes to see what I was doing. ... digging ... probably I don't need this any more.

It looks like I needed a repeatmasker output for this tool: github.com/makovalab-psu/AmpliCoNE-tool. The end goal is to estimate copy numbers of certain ampliconic genes in new primate assemblies. AmpliCoNE estimates this from reads but is given some information about the corresponding "reference" genome (among which is a repeatmasker "track").

In retrospect, that now seems like a roundabout way of getting at the answer we want. AmpliCoNE was designed more for the use case where you had, say, a reference human assembly and reads from other humans. I think we were only using it back in April as a verification to compare to results computed by some other methods (because: some previous publications used it). Moving forward, I think there are better ways to get this answer.

Anyway, thanks for your suggestions. It's good to know how I would run this. Hopefully I will recall this if and when the problem arises again. (Or if someone else in the lab does).

Thanks again.

Dfam-consortium / RepeatMasker

Seeking advice for splitting the run of a 6G diploid assembly #160