Issue with example data format

Wangray123 commented 1 month ago

Hello,

I am trying to prepare a data file for visualization using this software, but I don't understand what the score values in the 4th column of your example file (rice_MH63_repeat.bed) mean, as well as the values in the 5th column of the gene annotation file (rice_MH63_nonTEgene.gff3). repeat gff3

Could you please explain them to me? Also, what kind of command should I input to obtain this type of file? Could you please provide me with the code to obtain these two files?

I look forward to your reply. Thank you!

banzhou59 commented 1 month ago

The score in the fourth column of the rice_MH63_repeat.bed file represents the length of the TE annotated in the current bin. The fifth column of the rice_MH63_nonTEgene.gff3 file contains the end base position of gene annotations. You can learn more about the BED and GFF3 file formats through the following links:

https://github.com/jianshu93/gfftobed/blob/main/README.md https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md TE annotation files can be obtained using EDTA or RepeatMasker, and you can use bedtools for statistical analysis or convert formats using gfftobed (https://github.com/jianshu93/gfftobed). Gene annotations can be obtained through homology mapping or de novo annotation. You can also learn more about the input file formats for GenomeSyn through the following link: https://github.com/banzhou59/GenomeSyn/blob/main/GenomeSyn-1.2.7/README

Wangray123 commented 1 month ago

Thank you very much for your reply. and sorry, I'm a novice in bioinformatics analysis with little experience, I don't understand why counting the end base positions (Is it the position of the last gene?） of gene annotations can show gene density, rather than counting the number or length? Or my understanding is to count the number of genes in the window based on the end positions of gene annotations, can also displaying gene density on the graph?

banzhou59 commented 1 month ago

Hello, since GFF3 is a commonly used file format for gene annotation, we have chosen to use the GFF3 format as the input for gene annotations to facilitate visualization. In the program, the density is displayed by calculating the gene lengths within each bin. If a gene spans the boundary between two consecutive bins, the gene will be split according to the bin boundaries, and the corresponding portions of the gene length will be added to their respective bins for the calculation. Thank you for your excellent suggestion. Actually, the input format for annotations such as genes and repeats should also be set to either GFF3 or BED format, allowing both as options, and we will improve this in future updates. We are very grateful for your use of our software and your help in improving it!

banzhou59 / GenomeSyn

Issue with example data format #5