SEA-PHAGES / starterator

Released Stable version of Starterator for SEA phages; Note does not work with current version of phamerator database! For version compatible with current phamerator database, see this repo: cdshaffer/starterator
0 stars 2 forks source link

tracks with no starts little pink #14

Closed cdshaffer closed 6 years ago

cdshaffer commented 8 years ago

There is an issue with some Phams creating tracks with very little pink and NO starts annotated. An example can be seen with Bigfoot ORF 54 which is in Pham 20272.

examination of the alignment shows a long stretch at the beginning with only a single gene TM4_67. This could be the problem. Also consider: is there any reason to show any grey at beginning of track (i.e. show the region with only DNA sequence) should this region be trimmed so you always start with a pink alignment area?

cdshaffer commented 8 years ago

I have run a version of the code that simply ignores TM4 and not add it to the pham. The figure that results looks much better. So again something probably with the very long "upstream" unaligned region of TM4. This is going to be a tricky problem to solve if I need to change input values for clustal. The first thing to try would be to adjust the starting point for the track drawing. I tried this by hard coding the gd_diagram.draw() start position to 210 for the 20272 pham and was able to get good results. The 210 was the difference in the "ahead_of_start" for the longest versus the second longest. Not sure what criterion would work but nice to know I can probably fix by just choosing an appropriate position to start drawing.

looks like the info I need to assess is in graph_start_sites() in making_files. The info for where the pink parts are is in genes[i][i].alignment.features which is features of alternating type gap and seq. Could figure out the right end of the second shortest gap and draw the figure from there.

cdshaffer commented 8 years ago

found a few other likely examples pham 19869 Pham 19286 Pham 18837 Pham 15729

cdshaffer commented 6 years ago

update as of database version 198 see examples pham 45565 pham 45479

cdshaffer commented 6 years ago

another idea to explore is to switch to a zoomed in mode if there are large number of different starts all crowded close together. The idea would be to just draw a zoomed in region. for database version 198 good test phams are 57, 349, 429

first attempt is to set default zoom to to show at most 30 bases upstream of start 1 and 30 bases downstream of the last start. This improves 57 and 349 above and is likely to not be too different for users.

The issue remains when therre are a large number of starts all crowded together in a tiny portion of the track. I could try to calculate some kind of "start density" and iff too large either switch to or add a zoomed in mode. examples for this kind of issue would be the larger phams: 7612, 6622, 429, 5705.

based on pham report for 5705 that it would be good to have no more than 100 starts mapped across the whole track. so if more than that the idea would be to switch to a zoomed mode.

cdshaffer commented 6 years ago

continued work, now have a hopefully viable solution. code starts by considering all starts including 1 upstream of first annotation to one past the last annotation, let's call that the "region of interest". For this region in then counts the number of starts for each track looking for the track with the largest number of starts in the region of interest. It then uses that number to calculate how large a region it needs to show that many starts. This is set by the scale parameter which is basically an average number of starts per % of the track. This value is currently set to 1.1 so the number of bases to display in the track is set so that there is an average of 1 start in every 1.1% of the track for the region of interest.

This improves things nicely but still has an issue as it assumes an even distribution of starts, so when there are a large number of starts in the region of interest and they all cluster together they are still too close for legibility, but this new code does improve on average.