Jwindler / AutoHiC

A novel genome assembly pipeline based on deep learning
MIT License
34 stars 5 forks source link

Issues with running AutoHiC on two datasets #5

Open ksenia-krasheninnikova opened 1 year ago

ksenia-krasheninnikova commented 1 year ago

Hello,

I've been trying to run AutoHiC for two datasets from scratch. In both cases bwamem, juicer, 3d-dna steps seem to finish correctly, at least they didn't report any critical errors. But on the later stages both runs failed with different errors: The first dataset is for an insect Nudaria mundana. The error is

File /lustre/scratch123/tol/teams/tola/users/kk16/autohic_data/ilNudMud1/result/AutoHiC_ilNudMud1/autohic_results/3/ilNudMud1.final.hic cannot be opened for reading

Indeed there is no such file but the folder contents are

$ls ilNudMud1/result/AutoHiC_ilNudMud1/autohic_results/3/
black_list.txt                     ilNudMud1_lines.final.assembly  ilNudMud1_lines.final.hic
ilNudMud1_lines.cprops                 ilNudMud1_lines.FINAL.assembly  ilNudMud1_lines.mnd.txt
ilNudMud1_lines.final.asm              ilNudMud1_lines.final.cprops    png
ilNudMud1_lines.final_asm.scaffold_track.txt   ilNudMud1_lines.final.fasta     test.assembly
ilNudMud1_lines.final_asm.superscaf_track.txt  ilNudMud1_lines.FINAL.fasta

Another one is a high-coverage dataset for a protist Eimeria maxima, where the error is

│ AutoHiC/src/utils/get_chr_ │
│ data.py:127 in hic_loci2txt                                                  │
│                                                                              │
│   124 │   │   chr_len_list_sorted[chr_index + 1][0] = chr_len_list_sorted[ch │
│   125 │                                                                      │
│   126 │   #                                                                  │
│ ❱ 127 │   chr_len_list_sorted[0][0] = 0                                      │
│   128 │   if hic_len is not None:                                            │
│   129 │   │   chr_len_list_sorted[-1][1] = hic_len                           │
│   130 │   else:                                                              │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │            chr_dict = {}                                                 │ │
│ │        chr_len_list = []                                                 │ │
│ │ chr_len_list_sorted = []                                                 │ │
│ │             hic_len = None                                               │ │
│ │       redundant_len = 200000                                             │ │
│ │            txt_path = 'aut… │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
IndexError: list index out of range

The log files are attached.

autohic_ilNudMud121106.txt autohic_pxEimMaxi123763.txt

I wonder if it's possible to get some help with troubleshooting?

Many thanks!

Jwindler commented 1 year ago

Thanks for your feedback. The first bug will be fixed in the next version. It is caused by the genome format. You can use seqkit to split the genome sequence into 80 base sets on one line and run AutoHiC again. Regarding the second error, could you please provide me the file /lustre/scratch123/tol/teams/tola/users/kk16/autohic_data/pxEimMaxi1/result/AutoHiC_pxEimMaxi1/autohic_results/chromosome/chromosome.png, which will help me DEBUG

ksenia-krasheninnikova commented 12 months ago

Thanks for the quick reply. The file is attached

chromosome

Jwindler commented 12 months ago

Thanks for the documentation, it seems the problem has been solved. This image is the final global interaction heatmap of the chromosomes. AutoHiC will infer the number of chromosomes from this image. However, it seems that the chromosome data cannot be determined from the image. This resulted in an AutoHiC runtime error, which is currently being fixed. But as you can see from the heatmap, the assembly results are very bad, so AutoHiC cannot correct the errors and split the chromosomes. It is recommended to check the data for problems.

Jwindler commented 12 months ago

Thanks for the quick reply. The file is attached

chromosome

Could you please specify the size of the genome, the size of the Hi-C data, animal or plant, diploid or polyploid?

yumisims commented 12 months ago

It is a lepidoptera, it might well be a diploid.

Jwindler commented 12 months ago

This is very strange. AutoHiC also included diploid Lepidoptera in the testing process, but the assembly results were much better than what you provided. I suspect that the scaffolding results are too poor or the amount of Hi-C data is insufficient. Hope this helps.

ksenia-krasheninnikova commented 12 months ago

It was a protist genome of 45Mb. HiC data is high coverage but there is a possibility that it's problematic.

Screenshot 2023-09-19 at 14 07 36
ksenia-krasheninnikova commented 11 months ago

I can confirm with the row length 80bp in FASTA files the pipeline works correctly. Thank you for your help.

I've got AutoHiC results for a couple of datasets and would like to access the FASTA file for the assemblies labeled in .html file as 'Before adjustment'. From the code it seems like it should be the 3d-dna assembly with the lowest estimated number of 'Translocation' + 'Inversion' error. I wonder what happens when the corresponding sums are equal at different iterations (but they differ in 'Debris' number)? Also, what would be the best way to extract the FASTA file for it? Thank you.

UPD: I also wonder if it's possible to tell what are the scaffold names for the HiC maps in the 'Location' field under 'Error adjustment' section.

gfHygPuni3.result.html.zip ilMicArun2.result.html.zip

Jwindler commented 11 months ago

Thank you for your feedback. First of all, judging from your result report, the effect of ilMicArun2 is relatively good, but it seems that there is something wrong with the final number of chromosomes, and it may need to be manually adjusted according to your actual situation. The result of gfHygPuni3 seems not very good, I don't know why.

Jwindler commented 11 months ago

I can confirm with the row length 80bp in FASTA files the pipeline works correctly. Thank you for your help.

I've got AutoHiC results for a couple of datasets and would like to access the FASTA file for the assemblies labeled in .html file as 'Before adjustment'. From the code it seems like it should be the 3d-dna assembly with the lowest estimated number of 'Translocation' + 'Inversion' error. I wonder what happens when the corresponding sums are equal at different iterations (but they differ in 'Debris' number)? Also, what would be the best way to extract the FASTA file for it? Thank you.

UPD: I also wonder if it's possible to tell what are the scaffold names for the HiC maps in the 'Location' field under 'Error adjustment' section.

gfHygPuni3.result.html.zip ilMicArun2.result.html.zip

  1. Under the folder of the genome sequence before adjustment, named X..FINAL.fasta and X.rawchrom.fasta
  2. AutoHiC determines the number and sum of errors for each adjustment. The result with the lowest number of errors is finally selected. The genome hic files and the genome results of each alignment are located in the autohic_results directory.
  3. The best results are saved in the same directory as the report. (X_autohic.fasta)
  4. As for the customized sequence name, you can get it from the json file in the autohic_results directory. You can refer to this link: https://github.com/Jwindler/AutoHiC/blob/main/example/detail_result.md
ksenia-krasheninnikova commented 11 months ago

Thank you for reply. The errors reported in ilMicArun2.result.html are only present in autohic_results/0/inversion_error.json and autohic_results/0/idebris_error.json. However autohic_results/0 doesn't contain any fasta files. Which fasta file should be referred in this case? Thanks.

Jwindler commented 11 months ago

The first results are in the hic_results/3d-dna directory, 0, 1 and 2 have no fasta files. AutoHiC will adjust and generate fasta files based on the best results. If you want to get the fasta files of 0,1,2, you can get the x.assembly file from the hic_results/3d-dna directory and use the following command to generate the fasta file:

bash run-asm-pipeline-post-review.sh -r adjusted.assembly genome.fasta merged_nodups.txt Please specify the absolute path of each file run-asm-pipeline-post-review.sh in 3d-dna folder adjusted.assembly is output from onehic.py merged_nodups.txt is output from Juicer