AndersenLab / alignment-nf

A nextflow pipeline for genome sequences alignment
MIT License
1 stars 0 forks source link

Integrate PD1074 #10

Closed danielecook closed 4 years ago

danielecook commented 4 years ago

In order to realign all strains, we’ll need to have a close look at the new genome.

I started to have a look at the new reference PRJEB28388 / VC2010. Here are some resources to start:


The reference

I downloaded the reference and the chromosome names look like this:

>chrV_pilon 1 21243235
>chrX_pilon 1 18110855
>chrIV_pilon 1 17759200
>chrII_pilon 1 15525148
>chrI_pilon 1 15331301
>chrIII_pilon 1 14108536
>chrM_pilon 1 13988

This is discussed in the minutes (linked above):

Gene Ids of type 'PRJEB28388_chrIII_pilon.g6753' are causing problems for the WS270 run of the gene descriptions pipeline ... Hinxton to make code change to remove the problem IDs from the production files.

It sounds like the reference will be updated soon.

The GFF File

The GFF also has some issues as well. For instance, I can't identify gene names in it via grepping:

# returns nothing 
gzcat c_elegans.PRJEB28388.WS274.annotations.gff3.gz | grep 'pot-2'

Additionally, the GFF contains a lot fewer annotations.

Other considerations

CeNDR has tracks for:

These are not easy to produce. Alternatively, we can try to lift them over to the new reference but that will definitely take some time.

How to obtain the reference

When this does get integrated, it should be done so using a script.

danielecook commented 4 years ago

For the time being I believe this is being sidelined.