GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser
https://jbrowse.org/jb2
Apache License 2.0
208 stars 62 forks source link

Reference Sequence track doesn't match FASTA #4258

Closed dtdoering closed 8 months ago

dtdoering commented 8 months ago

TL;DR

I suspect that @gmod/faidx-js incorrectly calculates the LINEBASES (per the fai index spec) using the last line of a FASTA record (which is usually shorter than the rest of the preceding lines) rather than the actual/full line length (e.g. 80 is common).

Describe the bug

First observation

Description

In a JBrowse Desktop session, the Reference Sequence track does not correspond with the correct sequence (DNA or protein) I expect to see in the genome for a gene of interest, and the correct sequence doesn't appear to be anywhere nearby.

To Reproduce

  1. Get the same FASTA + GFF:
  2. Load the assembly into a session in JBrowse Desktop:
    • Open JBrowse Desktop to the start screen
    • Click on "Open Sequence File(s)
    • Assembly name "saltrop"
    • Assembly display name (empty)
    • Type: FastaAdapter (since we don't have an indexed one from NCBI)
    • Choose the FASTA downloaded from NCBI (GCF_000016425.1_ASM1642v1_genomic.fna)
    • Press "Submit"
    • Launch it as an LGV
    • Click "Open" (should open contig NC_009380.1)
    • Click "Open track selector"
    • Select "Reference sequence (saltrop)" -- the only option right now
  3. Load the annotations:
    • Click "File" > "Open track..."
    • Under "Main file" click "File" > "Choose File"
    • Choose the GFF downloaded from NCBI (genomic.gff)
    • Click "Next"
    • Leaving everything default, click "Add"
  4. Navigate to a region demonstrating the issue:
    • Navigate to a view centered on the start of the STROP_RS05130 gene: NC_009380.1:1,140,687..1,140,755

Observed behavior

Expected behavior

https://www.ncbi.nlm.nih.gov/projects/sviewer/?id=NC_009380.1&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD649220238,category:Sequence,annots:Sequence,ShowLabel:false,ColorGaps:false,shown:true,order:1][key:six_frames_translation,name:Six-frame translations,display_name:Six-frame translations,id:STD1365208605,annots:Six-frame translation,ShowOption:All,OrfThreshold:20,HighlightCodons:true,AltStart:false,SubsetAltStart:false,shown:true,order:3][key:gene_model_track,name:Genes,display_name:Genes,id:STD3194982005,category:Genes,annots:Unnamed,Options:MergeAll,CDSProductFeats:false,NtRuler:true,AaRuler:true,HighlightMode:2,ShowLabel:true,shown:true,order:5][key:feature_track,name:Repeat region,display_name:Repeat region,id:STD3463812800,subkey:repeat_region,category:Features,subcategory:Repeat region,annots:Unnamed,shown:true,order:6]&assm_context=GCF_000016425.1&v=1140680:1140762&c=CC99FF&select=gi|145592566-001167f0-00116c4b-0103-7a61336e-ffea8d58;&slim=0

Second observation

Description

The FASTA index file JBrowse Desktop is using lists a LINEBASES (4th field) of 51, whereas I believe it should be 80.

To Reproduce

At the command line:

  1. Navigate to the JBrowse Desktop app's folder (/Users/DTDoering/Library/Application Support/@jbrowse/desktop/fai for me)

    cd /Users/DTDoering/Library/Application Support/@jbrowse/desktop/fai
  2. View the index file:

    cat ./GCF_000016425.1_ASM1642v1_genomic.fna1709667217214.fai.fai
    # NC_009380.1   5183331 60  51  52
  3. The 4th field is 51, whereas running samtools faidx on the FASTA ourselves yields a 4th field of 80:

# navigate to wherever you saved the FASTA from NCBI
cd /Users/DTDoering/Downloads/ncbi_dataset (2)/ncbi_dataset/data/GCF_000016425.1

# index it
samtools faidx GCF_000016425.1_ASM1642v1_genomic.fna

# look at the index output
cat GCF_000016425.1_ASM1642v1_genomic.fna.fai
# NC_009380.1   5183331 60  80  81

Third observation

(I think) The @gmod/faidx-js testing snapshots also list the LINEBASES of the last line of their corresponding FASTA records -- 21 for ctgA and 79 for ctgB, rather than 60 for ctgA and 100 for ctgB.

Reproduce

At command line:

wget https://github.com/GMOD/faidx-js/raw/main/test/volvox.fa

samtools faidx volvox.fa

cat volvox.fa.fai                                                                                                                                                                              ─╯
# ctgA  50001   6   60  61
# ctgB  6079    50847   100 101

Compare to the snapshot:

ctgA    50001   6   21  22
ctgB    6079    50847   79  80

Final thoughts

I believe the incorrect FASTA indexing is the cause of the Reference Sequence track not appearing correctly, and that it was not caught by the existing faidx-js tests because they are incorrect as well.

Of course, I could be misunderstanding the FAI specification, or perhaps it was updated recently or something, but I wanted to document the unexpected behavior in case I'm not mistaken!

I'll also file an issue in the GMOD/faidx-js repo and link it here for tracking.

Please let me know if you need anything else to reproduce!

Screenshots

My JBrowse Desktop showing the region: Screenshot 2024-03-05 at 2 40 25 PM

Version

JBrowse Desktop: Version 2.10.2 (2.10.2)

OS: MacOS Sonoma 14.3.1

Samtools: I have the latest samtools from HomeBrew:

samtools --version                                                                                                                                                                             ─╯

samtools 1.19.2
Using htslib 1.19.1
Copyright (C) 2024 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             clang
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2
    LDFLAGS:        
    HTSDIR:         
    LIBS:           
    CURSES_LIB:     -lncurses

HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.6.0
    CC:             clang
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2 -fvisibility=hidden
    LDFLAGS:        -fvisibility=hidden 

HTSlib URL scheme handlers present:
    built-in:    preload, data, file
    S3 Multipart Upload:     s3w, s3w+https, s3w+http
    Amazon S3:   s3+https, s3+http, s3
    Google Cloud Storage:    gs+http, gs+https, gs
    libcurl:     imaps, pop3, gophers, http, smb, gopher, ftps, imap, smtp, smtps, rtsp, ftp, telnet, mqtt, ldap, https, ldaps, smbs, tftp, pop3s, dict
    crypt4gh-needed:     crypt4gh
    mem:     mem
cmdcolin commented 8 months ago

the PR auto-closed this but I just released a new version that should include a fix here https://jbrowse.org/jb2/blog/2024/03/06/v2.10.3-release/

thanks again for the detailed bug report, I hope it didn't cause you too much trouble :) if you have any more problems let me know