esolares / HapSolo

Reduction of Althaps and Duplicate Contigs for Improved Hi-C Scaffolding
GNU General Public License v2.0
19 stars 6 forks source link

Indefinite or halted run and other problems #3

Closed sivico26 closed 3 years ago

sivico26 commented 4 years ago

Hi again @esolares,

After preparing both blat and Busco inputs, I runned Hapsolo for my genome. Here is the command I used:

/home/svillanu/Programs/HapSolo/hapsolo.py -i $genome -p inputs/all_blat.psl -b inputs/busco -t $threads -n 10000 -B 5

The first abnormality arose when the program was running longer than expected (according to the manuscript minutes or hours, I am talking about days here). At first, I thought it was because I increased the number of iterations but then other things arouse as well.

I am used to monitoring new programs to see how they behave. I noticed Hapsolo is quite memory-greedy and even consumed all the memory of my big-memory node going to swap and filling swap as well (378 Gb RAM + 64 Gb Swap). At that point it, this error was thrown to Stderr: "OSError: [Errno 12] Cannot allocate memory".

I thought It would die after that. Surprisingly, it did not. It freed around 5% of RAM and continue to run okay (using 95 or 96 % of RAM) almost until the end. I think the memory overflow occurred between iteration 1000 and 2000, but not sure, it was at the beginning though.

I say almost because in the final days the process was zombie dead, just a few iterations before completion. The load of the node was literally 0 since then. Not sure when that happened though, it was at some point between the 4th and 6th day. Here are the final lines of the stdout log:

Thread: 6 Iteration: 9864 PID: 0.697840052589 QPctMin: 0.749624408845 QRPctMin: 0.748219642539 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9865 PID: 0.981652523466 QPctMin: 0.6996582334 QRPctMin: 0.988713060493 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9866 PID: 0.622771059337 QPctMin: 0.789602134318 QRPctMin: 0.609601244955 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9867 PID: 0.915375095066 QPctMin: 0.723506787404 QRPctMin: 0.639067875024 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9868 PID: 0.856326423758 QPctMin: 0.928365616211 QRPctMin: 0.778624561725 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9869 PID: 0.886717123028 QPctMin: 0.909846779352 QRPctMin: 0.828215474362 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9870 PID: 0.893778663033 QPctMin: 0.688867025539 QRPctMin: 0.880967651375 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9871 PID: 0.926879017717 QPctMin: 0.998530160183 QRPctMin: 0.626500697976 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9872 PID: 0.986239101466 QPctMin: 0.960922653281 QRPctMin: 0.67422769633 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9873 PID: 0.700156886313 QPctMin: 0.922191325029 QRPctMin: 0.725992714159 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9874 PID: 0.668620612795 QPctMin: 0.899669943682 QRPctMin: 0.647703942828 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9875 PID: 0.740902201141 QPctMin: 0.624417572385 QRPctMin: 0.996220080962 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9876 PID: 0.648385585591 QPctMin: 0.978341061174 QRPctMin: 0.637322594048 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9877 PID: 0.757874261042 QPctMin: 0.937099066762 QRPctMin: 0.91502840822 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9878 PID: 0.962106020897 QPctMin: 0.633275512907 QRPctMin: 0.875544395539 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9879 PID: 0.856038590899 QPctMin: 0.805744109036 QRPctMin: 0.706377312877 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9880 PID: 0.777671232127 QPctMin: 0.764298161442 QRPctMin: 0.685733338947 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9881 PID: 0.688367428802 QPctMin: 0.732459173159 QRPctMin: 0.951852832305 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9882 PID: 0.61227728141 QPctMin: 0.733645014281 QRPctMin: 0.736811699584 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9883 PID: 0.776652941736 QPctMin: 0.998175463159 QRPctMin: 0.690118740731 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9884 PID: 0.998448678075 QPctMin: 0.737478121682 QRPctMin: 0.956603024547 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9885 PID: 0.919531910188 QPctMin: 0.638240492371 QRPctMin: 0.95684031795 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9886 PID: 0.801795332192 QPctMin: 0.828104307473 QRPctMin: 0.93149032277 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iteration: 9887 PID: 0.815746567045 QPctMin: 0.867245035556 QRPctMin: 0.796855990086 CostFxnDelta: 0.0 ASMScoreFxn: 5000.0
Thread: 6 Iterat

The interruption of the last line is weird, but that is how it is in the log. Not sure why. It does not seem like a python thing.

The last thing I found weird it the "CostFxnDelta: 0.0". If I got it right, Hapsolo makes an optimization process and I interpret this variable as the delta of the optimization function between iterations. It is weird that it does not change at all, although the other values change with the iterations. Just to be clear, CostFxnDelta is 0.0 since the beginning (literally since iteration 0), and apparently it never got any different.

After all this, I thought I made a mistake providing the inputs. But the Blat psl is okay and as I understand the input for Busco is the parent directory where the busco### directories are located, is that correct?

The last question that I have is what are the threads doing? Because by the log it seemed like every thread is doing N iterations doing in my case around 240000 iterations, which sound like overkill. How is the sensible way of using them?

Worth to mention that by the logs, both the Busco and Blat inputs seem to be okay.

I really appreciate your thoughts on all this mess.

esolares commented 4 years ago

Hi,

It sounds like some of the threads may have crashed and python is waiting for them to finish, but can't because of out of memory issues. By the way, minimap2 paf files seem to take up less ram when running HapSolo. One thing to check is to see how many lines are in your psl file and how large is the file? can you do a ls -lh *.psl Also, how many contigs are in your contig assembly, and what is the current N50? It sounds like HapSolo is being overwhelmed by the number of alignment hits it's getting. Did you do a minimum length filter also?

With respect to the number of iterations, it is currently set to number of iterations per threads. I could add an option say -N that would take the total number of iterations and divide it by the total number of threads and take its ceiling. I understand now that perhaps -n * -t might cause confusion. It sounds like it would be best to just have a total number of iterations and have that divided by the amount of threads offered.

Thank you,

Edwin

esolares commented 4 years ago

With respect to the CostFxnDelta, its the difference of the current costfxn to the previous costfxn. It is used to check to see if it's on a plateau, in which it then randomizes the alignment parameters and restarts its search at that new randomized set.

sivico26 commented 4 years ago

Hi there, here are some summary statistics of my assembly generated with abyss-fac:

n       n:500    n:1000   n:2000   L50     min     N75     N50     N25     E-size  max     sum     name
13713   12358    11490    10722   1428    247    55625   112275  213463  168686  1627529 594.3e6 masurca.fasta

I did not filter the assembly. Although it is quite fragmented, I do not consider the overwhelming is due to a not so great number of small sequences. Instead, I think the issue might be its high repeat content (~ 47%) and the homoeologs present (it is an allotetraploid genome with mainly AABB heterozygous structure). What do you think?

The number of psl files is 13714 as expected (13713 sequences + 1 which merges them all).

Regarding the iteration threads, should I then reduce the iterations to, say 100, to get 2400 if I am using 24 threads?

So is it normal that CostFxnDelta is always 0? because according to your explanation, that would mean it randomized the parameters at every iteration, is that correct?

If the alignments are the problem I might try minimap2 to see if we see a difference here.

sivico26 commented 4 years ago

Okay, so I ran Hapsolo using .paf input. For the record I used the second minimap2 line to make the alignment

minimap2 -t 36 -P -k19 -w2 -A1 -B2 -O1,6 -E2,1 -s200 -z200 -N50 --min-occ-floor=100 ${QRY} ${QRY} > $(basename ${QRY} .fasta)_self_align.paf

The thing is, I got the following error:

Traceback (most recent call last):
  File "/home/svillanu/Programs/HapSolo/hapsolo.py", line 642, in <module>
    mypddf = CreateMM2AlignmentDataStructure(pafalignmentfile)
  File "/home/svillanu/Programs/HapSolo/hapsolo.py", line 485, in CreateMM2AlignmentDataStructure
    'chainingscore', 'secondchainingscore', 'approxdivergence', 'lqrhrepseeds'])
  File "/home/svillanu/.conda/envs/Hapsolo/lib/python2.7/site-packages/pandas/core/frame.py", line 369, in __init__
    arrays, columns = _to_arrays(data, columns, dtype=dtype)
  File "/home/svillanu/.conda/envs/Hapsolo/lib/python2.7/site-packages/pandas/core/frame.py", line 6284, in _to_arrays
    dtype=dtype)
  File "/home/svillanu/.conda/envs/Hapsolo/lib/python2.7/site-packages/pandas/core/frame.py", line 6363, in _list_to_arrays
    coerce_float=coerce_float)
  File "/home/svillanu/.conda/envs/Hapsolo/lib/python2.7/site-packages/pandas/core/frame.py", line 6420, in _convert_object_array
    'columns' % (len(columns), len(content)))
AssertionError: 18 columns passed, passed data had 17 columns

Digging about it I found that the produced .paf file has 17 columns, while the CreateMM2AlignmentDataStructure function is expecting 18, which explains the issue.

Then I wondered about the discrepancy and how to solve it. So I checked both the PAF specification and the columns CreateMM2AlignmentDataStructure is expecting. Apparently, the flag option -P in the minimap2 is the root of this, per documentation:

-P: Retain all chains and don’t attempt to set primary chains. Options -p and -N have no effect when this option is in use.

What I think is happening is that as Minimap2 is not setting primary chains, a differentiation between s1 and s2 stop making sense and minimap2 does not report s2 as a result. At least that happened when I ran minimap2 with the aforementioned line in and my genome. There is no column with s2:, making the file of 17 columns which makes that CreateMM2AlignmentDataStructure does not accept the .paf file.

Did minimap2 produce 18 columns when you run it using -P? I imagine that it did since it is in the README, but I do not see exactly how this varies from genome to genome.

What do you think? I am missing something?

PD: I am currently running Hapsolo with the .paf produced by the first recommended minimap2 line since it produced 18 columns.

esolares commented 4 years ago

I see. I will look into it then.

sivico26 commented 4 years ago

I am starting to feel that I am annoying with the issues, sorry about that. Continuing with the previous, step.

I got this error when running Hapsolo with the .paf file with 18 columns:

Traceback (most recent call last):
  File "/home/svillanu/Programs/HapSolo/hapsolo.py", line 691, in <module>
    mybestnscoreslist[i][4][1]) + '_primary.fasta'
  File "/home/svillanu/Programs/HapSolo/hapsolo.py", line 412, in CalculateInverseProportion
    inversePct = exp(-1.0 * log(myPct, 2))
ValueError: math domain error

Apparently, the error is reproducible. I run it twice and on both occasions emerged at iteration 31. I tried to follow the code to see why the function got a 0 as input into that log, but I did not succeed, to be honest.

Thanks for your help Edwin.

esolares commented 4 years ago

Hi,

No worries you are not annoying at all. I am happy to help and apologize for the late reply. I have had to take our doggie to the vet for a mass growth in his gums and today took the little one to archery practice outdoors.

With respect to the large amount of RAM being used, it does not appear to be due to the size of the PSF file, but perhaps I could take a look at it, if you compress it into a gz file.

It also appears you have found a bug, where HapSolo is generating a value of 0.0. I will look into this and correct it in the code. Also, HapSolo doesn't always report a costfxdelta of 0, that is only when it encounters a plateau, when it does not encounter a plateau it increments forward in the positive direction. With respect to the number of iterations, you could run 100 just to test the run but I would recommend a total of 50,000 iterations, if you are running 24 cores, then perhaps 2,000 iterations per thread would be good for an actual run. I will perform some test cases and make the changes and push them to github. Please give me a few hours. I will let you know when I have fixed the error.

esolares commented 4 years ago

Hi,

Please try running HapSolo once more. I have gone ahead and updated both branches to no longer pass values of 0.0 for PID, Q and QR during the initial hill climbing step. I have also fixed the error where 18 columns are always expected from the PAF file. Please let me know if you are still having issues.

Thank you for letting me know about these bugs. Your help and feedback is absolutely appreciated! 👍

Thank you again,

Edwin

esolares commented 4 years ago

Hi,

I hope all is well. I was wondering if you were still having the same issues.

Please let me know.

Thank you,

Edwin

sivico26 commented 4 years ago

Hi Edwin, thanks for your reply. These have been busy days, so I have not had enough time for this particular task.

As a preliminary report, I changed the minimap2 line to the third one recommended and the job is currently running. I hope to have news tomorrow.

Best

esolares commented 4 years ago

Hi,

No worries. Either line should work now as it now takes both number of columns generated by minimap2. Looking forwar to the news, bad or good. It's good to see that you caught a few bugs as it allows me to fix them, and ultimately have a better software package.

Thank you again for your help and feedback.

Edwin

sivico26 commented 4 years ago

Okay so here is what I got:

When I use the first minimap2 line to generate the .paf file:

minimap2 -t 36 -k19 -w5 -A1 -B2 -O3,13 -E2,1 -s200 -z200 -N50 --min-occ-floor=100 ${QRY} ${QRY} > $(basename ${QRY} .fasta)_self_align.paf

The result is really small. When I run HapSolo using this file I still get ValueError: math domain error as before (seemingly in the same place, btw). I do not know if this is to worry as the file seems to be wrong in the first place. I attach it in case you find it useful.

masurca_overlap2.paf.gz

Now, the other two suggested minimap2 lines produce the expected .paf files (~14 Gb uncompressed and 3.1 Gb gzipped) and therefore I am unable to share them. When I run Hapsolo with either of these, I return to the original topic of this issue: the process runs out of memory and eventually gets zombie dead.

Just out of curiosity, what do you think is driving such high memory consumptions?

I will try to filter out some small sequences and try again to see if I can avoid de memory overflow.

sivico26 commented 4 years ago

Wanted to add that after filtering sequences shorter than 2000 bp (which removed around 3000 seqs.) the size of the .paf is basically the same (14 Gb uncompressed, 3.1 Gb gzipped). Thus, it is not a surprise that with filtering the problem persists, HapSolo runs out of memory and the process gets zombie dead.

esolares commented 4 years ago

Hi,

Would it be possible to upload your input files here? I would like to test your run and see where these errors are originating from. I don't see why you would still get a math value error as I have eliminated the possibility of giving the log function a value of 0....

https://drive.google.com/drive/folders/1j_k31X-UqIDXgR_UwvX8IyOjUyhkrK_y?usp=sharing

esolares commented 4 years ago

If possible, please change the share settings to only you and I after accessing the link.

On Sat, Aug 15, 2020, 9:06 AM Simón Villanueva notifications@github.com wrote:

Wanted to add that after filtering sequences shorter than 2000 bp (which remover around 3000 seqs.) the size of the .paf is basically the same (14 Gb uncompressed, 3.1 Gb gzipped). Thus it is not a surprise that with filtering the problem persists, HapSolo runs out of memory and the process gets zombie dead.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esolares/HapSolo/issues/3#issuecomment-674417132, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBDVPTHM7FTHER6ALEDOBDSA2W43ANCNFSM4PX57O2A .

sivico26 commented 4 years ago

I can share the .paf and Busco files, but since this is unpublished data I'm afraid I can not share the assembly itself. Regarding the Busco files. The folder is huge (More than 200 Gb when I checked yesterday), which files does Hapsolo uses? Looking at the code I saw that the full_report.tsv is needed, but want to be sure what else is needed before filter and packaging.

esolares commented 4 years ago

Hi,

You can run tar czvf buscos.tar.gz contigs/busco/*/*/full_table*.tsv and upload that. Also you can gzip the paf file. With respect to the assembly you could just send the contig names and lengths by running bioawk -cfastx '{print $name"\t"len($seq)}' $myasm > $myasm.lengths I can just generate a random set of sequences to mimic the size of your assembly.

Are you able to change the share settings so that only you and I can see the shared folder?

Thank you,

Edwin

sivico26 commented 4 years ago

Apparently changing the permissions I excluded my self without uploading the files hahaha. I already prepared the files and requested access again.

Thanks Edwin

sivico26 commented 4 years ago

All right, files uploaded.

The small .paf is the one that produces the math error, while the large one produces zombie state.

esolares commented 4 years ago

Hi,

Thank you for uploading the files so I can further investigate what is going on. I am currently running it on the smaller paf file. so far I am using 1GB fo ram and all threads are at about iteration 80ish. I have also noticed that the scores are all at 50,000,000.0. Which means it is not finding any contigs with a single complete BUSCO in it after purging. The default value is set, when single BUSCO's are 0, since the cost function contains single BUSCO's in the denominator. This is done in order to avoid a divided by 0 error. I have yet though to get the math error. I will continue running it and if I still do not get the error, I will run it again. I did a fresh pull from the minimap2 branch to make sure all new updates are incorporated.

I did some digging and found that although your assembly has 13713, only 1661 of those contigs contain complete BUSCO's. This is most likely the problem as most contigs (>87.8%) do not contain a complete BUSCO's. Is it possible some of the BUSCO runs could have crashed? Is it possible there could be some other issue in your assembly? Have you polished your assembly already? My suspicion is either a majority of the BUSCO runs failed, or there was an issue with the parameters of the BUSCO run.

Please let me know your thoughts and also if you find anything.

Thank you,

Edwin

esolares commented 4 years ago

Hi,

I have identified another bug, where the contig name was being pulled from the name of the BUSCO full table file. When a different naming convention is used from what is recommended in the bash script I provided, HapSolo will incorrectly designate the wrong contig name in the data structure. I believe I have patched that bug and will be testing it now. I still am however getting no single BUSCO's when I run HapSolo with the updated code. I will investigate further and update you accordingly. Thank you for helping me catch another bug. Hopefully, I will be able to solve all the issues with your run of HapSolo.

Will keep you posted.

Edwin

sivico26 commented 4 years ago

Hi Edwin,

Thanks for the speed in evaluating this case. Very interesting findings. I also was puzzled by the way the contig names were loaded from the Buscos and I was about to ask. Not sure why I did not mention it.

Regarding the Buscos of the assembly, the global value I get is 96.4% or 1937 complete in eudicotiledons_odb10 dataset. I will take the individual reports and check If I got the same result and will return to you.

esolares commented 4 years ago

Hi,

Yes, I assumed everyone would use my preprocessing scripts, but even then, i'm sure this bug would have come up eventually, as things always can change. This time I extracted it directly from the run command on the 3rd line of every full table file. I have also fixed the bugs that arose from it. I am running it again now on 10 cores and 5000 iterations and I am now getting scores. It does however appear that the score isn't decreasing much so far, but then again it's only on it's 30th iteration. I'll continue running it and let you know.

With respect to the massive amount of memory being used on the larger file, I have some ideas on how to reduce memory usage. I will tackle that after I can get a successful run from the smaller file.

I will keep you posted. Also, please let me know what you find with regards to the number of contigs that contain BUSCO's. In the end it should be fine? only because HapSolo uses contigs that contain BUSCO's to train on and find optimal purging parameters which are applied assembly wide.

Thank you,

Edwin

sivico26 commented 4 years ago

Hi Edwin,

Really nice to see the workarounds. As a curiosity, don't you think the contig names can be extracted from the assemblies instead of the Busco files? Or do you have a reason to do it from the Busco files? Is it more robust?

Regarding the Global Busco of the assembly, this is what I found:

In extended Busco categories (defined by me haha):

Complete single: 1537 (72.47)
Duplicated: 429 (20.23)
Fragmented: 26 (1.23)
Comp_fragmented: 44 (2.07)
Dup_fragmented: 53 (2.5)
Mul_fragmented: 4 (0.19)
Missing: 28 (1.32)

I shall mention that Comp_fragmented means that the gene is found complete in a contig and fragmented at least once in another. Similarly, Dup_fragmented means found Duplicated in some contigs and fragmented in others. Finally, Mul_fragmented is just that the gene was found fragmented multiple times.

In traditional Busco categories:

Complete single: 1581 (74.54)
Duplicated: 482 (22.73)
Total complete: 2063 (97.27)
Fragmented: 30 (1.41)
Missing: 28 (1.32)

The results are actually better than I expected regarding completeness, although the duplicates skyrocketed. I did not imagine the difference between global Busco and sequence by sequence. However, I have to conclude the Busco files you are working on are fine.

The best explanation I have for your observation (most contigs with no Buscos) is the biological one, regions with coding genes tend to be easier to assemble so we get longer contigs and hence tend to be in a few contigs. I think that also means that most of my contigs broke in repetitive sequences with no coding genes near.

I hope it helps.

esolares commented 4 years ago

Hi,

I have the results for the smaller alignment file.

Writing masurca_1000_0.3082_0.8118to1.3509_0.8784_primary.fasta with score: 0.315059861374
Writing masurca_1000_0.2669_0.2799to6.2778_0.8118_primary.fasta with score: 0.315059861374
Writing masurca_1000_0.7887_0.6800to1.7444_0.8692_primary.fasta with score: 0.315059861374
Writing masurca_1000_0.8060_0.9600to1.0607_0.4431_primary.fasta with score: 0.315059861374
Writing masurca_1000_0.5949_0.4370to3.3012_0.2388_primary.fasta with score: 0.315059861374

What is concerning is the wide range of parameters. This is probably due to a low number of BUSCO hits in over 87% of the contigs. As I thought earlier, it seems to not have enough data to train on for the model to properly classify althaps...

I will look into implementing optimizations to increase speed since it took 43279 seconds or about 12 hours. RAM usage was actually pretty good in the MB's Maximum resident set size (kbytes): 112968

I will try the larger file and see it's memory usage.

esolares commented 4 years ago

Hi again,

I was able to successfully load the larger paf file. I have gone ahead and made some major RAM efficiency changes with a more than 50% reduction in RAM usage. I saw initially it was approaching >120GB of RAM which is the maximum amount of RAM available on my in house test server. I currently have RAM usage down to 62.5GB. Still hefty but I hope to reduce this even further with more updates. I will update the github mm2 branch soon and keep you posted.

The min score again was 0.315059861374. I believe this is again due to the lack of BUSCO's in the >87% of the contigs. With the mosquito example from the paper, I saw that ~50% of the contigs contained BUSCO's. I will however have to update HapSolo to support the new changes in the latest BUSCO software. It currently supports the older version of BUSCO as stated in the paper.

The best result so far has been a removal of 4,813,394 nt's from the assembly.

Thank you,

Edwin

sivico26 commented 4 years ago

Hi Edwin, thanks for keeping me posted.

Really nice results indeed especially the RAM memory savings, I think it makes the program more usable for future users. Regarding the results so far, this is a difficult allotetraploid genome with really high levels of heterozygosity (11.67%) and a considerable proportion of repetitive sequences (47%). It would not surprise me if Busco alone is not enough to properly separate the haplotypes.

As a curiosity, I was playing with Busco and the global vs sequence by sequence difference in results. I reproduce it in the latest stable release (4.1.1), at least for my dataset. I think at some point this week I will report it to the Busco team to see what they think. I can keep you posted on that if you are interested.

Regards

esolares commented 4 years ago

Hi,

Sorry for the late reply. I have updated the Minimap2 branch to use less RAM. Also, when I used 48 cores, I was able to run HapSolo in 2 hours and 13 minutes. Were there any issues with BUSCO not completing properly on other contigs?

I have checked all my other datasets and I have at least 25% of the contigs containing at least one complete BUSCO. Perhaps 15% is too low for HapSolo to be able to estimate the correct alignment parameters for purging. This result is very interesting.

Please let me know if you are able to increase the proportion of contigs with complete BUSCO's. Also, could you try to run HapSolo again on your cluster? Feel free to run as many scores possible, taking into account that you have enough RAM allocated for your run. Usually it is 4GB/core.

Thank you,

Edwin

sivico26 commented 4 years ago

Hi Edwin,

It is interesting what you tell me about the BUSCOs of other datasets I will let that sink. As I said before, summing the results of all the contigs gives even better results than The BUSCO of the whole assembly. I double-checked the logs and folders again and I did not see anything that indicates any problem. So I think this characteristic is indeed biological, or maybe the assembly is too fragmented.

I will run Hapsolo again in my cluster and let you know it goes.

Regards,

sivico26 commented 4 years ago

Hi Edwin,

I already had the chance to rerun Hapsolo with the current changes. First of all, I want to congratulate you because the memory footprint reduction Is really significant. Memory consumption was 36 Gb most of the time but I think it peaked to 52 Gb before finishing. Anyway, it is very different from the 100's I got in previous runs.

As I was writing this reply I noticed I set up just a few replicates (100 x 24 threads = 2400) and not the 50000 you recommended. me. I will run the longer one and will come back to you when finished. Just to let you know, this one took 3h to complete.

Regards

sivico26 commented 4 years ago

Hi Edwin,

I rerun with a higher number of iterations (2000 x 24 = 48000) and the process took 7h and 41 minutes to complete. the results though are somewhat disappointing :

n   n:500   L50 min N75 N50 N25 E-size  max sum name
13713   12358   1436    500 55224   111867  212939  168206  1627529 596e6   masurca.fasta
13452   12097   1430    500 55568   112108  213386  168565  1627529 594.7e6 asms/masurca_1000_0.9440_0.8950to1.1736_0.4149_primary.fasta
261 261 49  1002    4122    8436    13577   11025   46445   1357609 asms/masurca_1000_0.9440_0.8950to1.1736_0.4149_secondary.fasta

Basically, no sequences were purged. The curious thing is that I got better results in the previous run when I used only 100 iterations: HapSolo removed around 15 or 16 Mb as you told me previously. Any idea why?

Anyway, maybe you are right and there are not enough scaffolds to train properly the algorithm. Or maybe is something else, I am starting to think that this assembly is not as redundant or there are not as many haplotypes there as I thought. But that would contradict some solid results I got using genome profiling, so I am still trying to grasp this through.

esolares commented 4 years ago

Hi,

That is very dissapointing to hear. Do you have the scores for the first and second run? They should be in your stdout files from your cluster output. I honestly think there is just not enough data to train from. Perhaps tweaking the BUSCO settings. What settings are you using for the odb9 and for the species? Is there a closer species to use when Augustus does it's HMM training? I think if we are able to have more contigs with BUSCO's in them, it should provide more data to train with. One thing that I noticed is that with each increment in step, the delta scores were almost always at 0. I even saw huge differentiation in PID, Q and QR values but all having the same score. This as due to a very low number of BUSCO's in the contigs/scaffolds. I'll dig more into this.

Also have you finished polishing the assembly? I usually recommend 2x arrow 1x pilon. an extra run of pilon doesn't hurt but the benefits are usually slim to none on the second run of pilon.

esolares commented 4 years ago

FYI. I have had massive improvements in BUSCO identification after polishing. I recommend two runs of Arrow or Racon and then a run or two of Pilon.

esolares commented 4 years ago

Also please make sure you are using the latest version of BUSCO 3 as BUSCO 4 is still not supported. Also instead of -sp arabidopsis have you tried using -sp tomato? For our chardonnay genome we also used arabidopsis. Here are the options I used.

MYLIB="embryophyta_odb9" SPTAG="arabidopsis" OPTIONS="-l ${MYLIBDIR}${MYLIB} -sp ${SPTAG}" then your buscorun ${OPTIONS} This should all be included in the github repo also. It does however look like you used the script I provided for running BUSCO on each contig, but I'm not 100% sure.

I would try seeing if you can maximize your BUSCO results first on your polished assembly (having a polished assembly is very important) and then try running HapSolo. You don't need to run it on each contig just your whole assembly first. That way you save time and resources. Once you find the optimal set of options you should be good to go.

sivico26 commented 4 years ago

All right, I am going to do some polishing and evaluate if it improves the Busco score. To answer some of your questions:

Thanks for the advice, I will return to you after polishing.

Regards

sivico26 commented 4 years ago

Hi again, Edwin.

I performed two polishing steps of Polca with the short sequencing data I have. You can see in the paper that Polca and NextPolish are faster and more accurate than Pilon, which has become the reference polisher.

Anyway, here are the results:

Field   Complete    Single  Duplicate       Fragmented      Missing Total
Raw     2046    1939    107     29      46      2121
Pass1   2049    1941    108     26      46      2121
Pass2   2050    1942    108     25      46      2121

As you can see, the polishing recovered some fragmented genes, which is great. However. I think for our purposes the improvement is rather marginal and it would not change our results much. Or what you think?

Regards

esolares commented 4 years ago

Hi,

It looks like you have most busco's accounted for with 96.5%. So far I haven't run HapSolo on an assembly that could not be improved. So it's very likely, your assembly is one that can't take advantage of HapSolo's method, since there aren't enough contigs that contain BUSCO's for HapSolo to train off of. The only way to know for sure is to run BUSCO again on each individual contig, since there was a small and minor increase in complete BUSCO's, but my guess is, that the improvement is not enough. This could be because of fragmentation. I did think that perhaps it was due to high ploidyness, but there should still exist many contigs that contain BUSCO's. There could be many things we could speculate over, such as timing of genome duplication, conservation of euchromatic regions that contain BUSCO's. Also I see that the number of duplicate BUSCO's is already very low. Perhaps there is not much heterozygosity in the regions that contain BUSCO's. HapSolo is dependent on heterozygosity that occurs in regions that contain BUSCO's. So a genome that contains a low number of duplicate BUSCO's won't benefit much by using HapSolo because it tries to minimize on the number of Duplicate and missing BUSCO's.

In short, I agree with you. I also do not think your genome would improve much by running HapSolo. Have you tried purge_dups? They use a constant cutoffs for coverage and alignment similarity. My apologies that HapSolo couldn't be of more use to you. I should update the readme so that it informs the user that HapSolo works best when there are a high number of duplicates and/or when there are a minimum % of contigs that contain BUSCO's.

Thank you,

Edwin

esolares commented 3 years ago

I will be closing this ticket. Please let me know if your issue is still outstanding.

Thank you again,

Edwin

sivico26 commented 3 years ago

Hi Edwin,

I apologize for letting the time pass by. The genome presents high heterozygosity (according to genome profiling), but I also suspect it is not that much in the Busco orthologs as you said, hence subtracting power to HapSolo.

We were doing some preliminary annotation this past month and found that gene duplication is indeed quite high. It seems like it is not well captured by Busco, even when running scaffold by scaffold.

I appreciate all your collaboration in this case. I think HapSolo is a great tool, but as it is common in bioinformatics, there is no one-size-fits-all solution and this is just an edge case. I indeed tried purge_dups and, although it got better results, they were still minimal (it removed ~3000 scaffolds, but totaling just ~15 Mb).

Again, thanks for everything!