calkan / sonic

Some Organism's Nucleotide Information Container
6 stars 1 forks source link

Empty gaps.bed files from some reference genomes (e.g. C. elegans) with no type U or N #12

Closed moldach closed 4 years ago

moldach commented 4 years ago

In step 3, for gap annotations it states we should extract the files from UCSC's chromAgp.tar.gz into a folder, merge them and then grep for component type U or N (which is in column 5).

IMO the description for this file by UCSC is lacking; there is no description of what type U or N are:

chromAgp.tar.gz - Description of how the assembly was generated from fragments, unpacking to one file per chromosome.

From the mouse example I see there are 4 fields:

cat */*|awk '{print $5}' | sort | uniq
F
N
O
W

I would like to build a SONIC for C. elegans (WBcel1235) but there are no type U or N to be found (I only see type F - what's this?) - resulting in an empty .bed file.

What do these mean?

Furthermore, in step 4 for segmental duplication annotations there is a genomicSuperDups.txt.gz file for the mouse genome; however, I cannot find reference for this in C. elegans. I will contact UCSC to find out if they have a comparable file for this build.

My question to you is, in the absence of such file(s). Would there be any harm building SONIC when passing these two empty files (gaps.bed & dups.bed)?

sonic --ref ref.fasta --reps reps.out --gaps gaps.bed --dups dups.bed --make-sonic cell.sonic --info "UCSC_WBcel235"

Number of chromosomes: 7
Adding gap intervals to SONIC.
Read 0 BED entries.
Writing entries for chromosome 6
Wrote 0 entries.
Adding segmental duplication intervals to SONIC.
Read 0 BED entries.
Writing entries for chromosome 6
Wrote 0 entries.
Adding 99878 repeats to SONIC.
Read 99857 BED entries.
Writing entries for chromosome 6
Wrote 99857 entries.
Adding GC profile SONIC.
SONIC file cell.sonic is ready.
Memory usage: 31.61 MB.
calkan commented 4 years ago

no that won't be a problem. The gaps file is merely used to filter out calls that span assembly gaps, certainly an issue in mammalians since those gaps are surrounded by repeats, causing mapping ambiguity. Similar for dups.

calkan commented 4 years ago

added this explanation to README. Closing.