ERROR:root:Error reading GenomeScope summary file: 'NoneType' object has no attribute 'group'

mahesh-panchal commented 2 months ago

Hi, I'm trying to get something up to make EARs and I'm having trouble testing the script.

So far the (test) input has:

# This is the yaml (v24.08.26) file for generating the ERGA Assembly Report (EAR) using the make_EAR.py (v24.08.26) script
# Please complete the required information pointed as <Insert ...>
# The file example/mEleMax1_example.yaml contains an example of a completed yaml file

# SAMPLE INFORMATION
ToLID: 'iyAncNigr1'
Species: 'Ancistrocerus nigricornis'
Sex: 'XX'  # for example: XX, XY, X0, ZZ, ZW, unknown, NA
Submitter: 'Mahesh Binzer-Panchal'
Affiliation: 'National Bioinformatics Infrastructure Sweden'
Tags: 'ERGA-BGE'  # valid tags are ERGA-BGE, ERGA-Pilot, ERGA-Community

# SEQUENCING DATA
DATA:  # add below name of available data and coverage
  - hifi: '34.7x' # if coverage is not available, leave it empty

# GENOME PROFILING DATA
PROFILING:
  GenomeScope:
    genomescope_summary_txt: '../../data/outputs/workflow-development/01_read_inspection/genescopefk/Ancistrocerus_nigrico
rnis_summary.txt'
  Smudgeplot:  # Smudgeplot is not mandatory (but preferred for ploidy estimation), if not available, leave it empty
    smudgeplot_verbose_summary_txt: <Insert Smudgeplot results summary.txt file path>

# ASSEMBLY DATA
ASSEMBLIES:
  Pre-curation:
    collapsed:  # valid types are hap1, pri, collapsed

      gfastats--nstar-report_txt: '../../data/outputs/workflow-development/03_assembly/hifiasm-raw-default/hifiasm-raw-def
ault.asm.bp.p_ctg.fasta.assembly_summary'
      busco_short_summary_txt: '../../data/outputs/workflow-development/03_assembly/busco/hifiasm-raw-default/short_summar
y.specific.hymenoptera_odb10.hifiasm-raw-default.asm.bp.p_ctg.fasta.txt'
      merqury_folder: '../../data/outputs/workflow-development/03_assembly/merqury'

  Curated:
    <Insert haplotype>:  # valid types are hap1, pri, collapsed
      gfastats--nstar-report_txt: <Insert gfastats--nstar-report.txt full path>
      busco_short_summary_txt: <Insert busco_short_summary.txt full path>
      merqury_folder: <Insert Merqury results folder path>
      hic_FullMap_png: <Insert pretext FullMap.png full path>  # also can be a HiC full contact map PNG from higlass
      hic_FullMap_link: <Insert .pretext file web link>  # also can be a web folder with .mcool from higlass
      blobplot_cont_png: <Insert blobplot contamination .png file full path>

# METHODS DATA
PIPELINES:  # add below name of the tools used for the assembly and curation steps, with versions and key parameters selec
ted
  Assembly:
    <Insert ToolA>: <Insert ToolA version>/<Insert ToolA parameter>/<Insert ToolA parameter>  # First field correspond to 
version. Use / after each field to enter the parameters used
    <Insert ToolB>: <Insert ToolB version>

  Curation:
    <Insert ToolX>: <Insert ToolX version>  # First field correspond to version. Use / after each field to enter the param
eters used
    <Insert ToolY>: <Insert ToolY version>/<Insert ToolY parameter>

# CURATION NOTES
NOTES:
  Obs_Haploid_num: <Insert observed haploid number> # integer
  Obs_Sex: <Insert observed sex>  # for example: XX, XY, X0, ZZ, ZW, unknown, NA
  Interventions_per_Gb: <Insert manual intervernation during curation>  # integer or empty
  Contamination_notes: <Insert contamination notes>  # text in quotes "", related to the decontamination process, or prese
nce of plastids or symbionts
  Other_notes: <Insert other notes>  # text in quotes "", related to sample characteristics and quality, the curation proc
ess, etc

i.e. so it's not filled in fully, but I was hoping to see where I could get.

I'm running in the environment supplied using my own script.

#! /usr/bin/env bash
set -ueo pipefail

function generate_ear {
        INPUT="${INPUT:-$1}"

        set +u
        eval "$(conda shell.bash hook)"
        conda activate ./conda-erga-ear
        set -u

        python erga-assembly-reports/make_EAR.py "$INPUT"
}

generate_ear ear_input.yml

with v24.08.26 of the python script.

The error I'm getting in the EAR.log is:

ERROR:root:Error reading GenomeScope summary file: 'NoneType' object has no attribute 'group'

and I have no idea what that means. Can I get some guidance on this please?

mahesh-panchal commented 2 months ago

Also the file does exist at that location:

less ../../data/outputs/workflow-development/01_read_inspection/genescopefk/Ancistrocerus_nigricornis_summary.txt
GenomeScope version 2.0
input file = Ancistrocerus_nigricornis_histex.hist
output directory = .
p = 2
k = 31
name prefix = Ancistrocerus_nigricornis

property                      min               max               
Homozygous (aa)               99.6338%          99.6487%          
Heterozygous (ab)             0.351313%         0.366188%         
Genome Haploid Length         NA bp             232,510,441 bp    
Genome Repeat Length          59,822,445 bp     59,893,829 bp     
Genome Unique Length          172,549,437 bp    172,755,335 bp    
Model Fit                     85.9618%          96.9602%          
Read Error Rate               0.181077%         0.181077%

diegomics commented 2 months ago

Mmm I think the NA may be messing the things, let me try to replicate the issue

diegomics commented 2 months ago

Yep, that's it. We are using the min Genome Haploid Length, so having NA value there will throw that error. I can improve the code to use the max if the min is NA. But I never encountered this case with NA in the min. Is this ok? Have you seen this @tbrown91?

mahesh-panchal commented 2 months ago

I should add this is from GeneScopeFK and not GenomeScope2 despite what the summary says. I thought the outputs were identical, but perhaps not.

diegomics commented 2 months ago

Yes, I could get that from the folder name. Maybe this is a novelty of the FK-fork. Ok, I will make the update in the code.

diegomics commented 2 months ago

This should be fixed now. Reopen if not

mahesh-panchal commented 2 months ago

Thanks. That looked like it worked.

ERGA-consortium / EARs

ERROR:root:Error reading GenomeScope summary file: 'NoneType' object has no attribute 'group' #67