Integrate PD1074 - Githubissues

In order to realign all strains, we’ll need to have a close look at the new genome.

I started to have a look at the new reference PRJEB28388 / VC2010. Here are some resources to start:

Wormbase directory - ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJEB28388
FASTA - ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJEB28388/sequence/genomic/
GFF - ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJEB28388/sequence/genomic/
Recompleting the Caenorhabditis elegans genome - Paper
Wormbase Minutes discussing new reference

The reference

I downloaded the reference and the chromosome names look like this:

>chrV_pilon 1 21243235
>chrX_pilon 1 18110855
>chrIV_pilon 1 17759200
>chrII_pilon 1 15525148
>chrI_pilon 1 15331301
>chrIII_pilon 1 14108536
>chrM_pilon 1 13988

This is discussed in the minutes (linked above):

Gene Ids of type 'PRJEB28388_chrIII_pilon.g6753' are causing problems for the WS270 run of the gene descriptions pipeline ... Hinxton to make code change to remove the problem IDs from the production files.

It sounds like the reference will be updated soon.

The GFF File

The GFF also has some issues as well. For instance, I can't identify gene names in it via grepping:

# returns nothing 
gzcat c_elegans.PRJEB28388.WS274.annotations.gff3.gz | grep 'pot-2'

Additionally, the GFF contains a lot fewer annotations.

Other considerations

CeNDR has tracks for:

phyloP
phastCons

These are not easy to produce. Alternatively, we can try to lift them over to the new reference but that will definitely take some time.

How to obtain the reference

When this does get integrated, it should be done so using a script.

AndersenLab / alignment-nf

Integrate PD1074 #10

The reference

The GFF File

Other considerations

How to obtain the reference