PacificBiosciences / apps-scripts

Miscellaneous scripts for applications of PacBio systems
Other
25 stars 11 forks source link

Multigene families : trimLowercaseContigs.py #4

Open a-velt opened 6 years ago

a-velt commented 6 years ago

Dear Sarah,

I apologize in advance for much bothering you right now, but I am in the final stages of my phased diploid genome assembly.

I used your script to remove contigs with more than 50% of their bases non-polished. I ran this script on primary contigs and haplotigs separately (29 contigs / 1825 removed, 83 haplotigs / 2901 removed). Then, I took the contigs/haplotigs that the script removes and I launched blastx on it, to see what it corresponds to.

For primary contigs, these are mostly repeated regions, functions that do not interest me, or viruses / bacteria. But for haplotigs, I was quite surprised at the results. A significant portion of the haplotigs that are removed correspond to multigene families. And of course, I don't want to delete them.

Is this due to the fact that the reads multimapped and so that during the polishing phase, some haplotigs are not properly covered and FALCON doesn't polish them?

Do you think I have to manually sort haplotigs that correspond to a multigenic family and put them back in the haplotig file I keep?

Thank you very much for your help. Best, Amandine