combogenomics / medusa

A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.
http://combo.dbe.unifi.it/medusa/
GNU General Public License v3.0
42 stars 15 forks source link

Getting the distance estimation as an output #18

Open ShaiberAlon opened 7 years ago

ShaiberAlon commented 7 years ago

When using the -d option (for distance estimation) could there be a way to get the distance estimation as an output (for example in the SUMMARY file)?

ShaiberAlon commented 7 years ago

My mistake! I see that there is an output '*distanceTable'. Sorry for that.

But I see that all the distances I got are either 1.0 or 0.0 does that make sense? Also, is there a way to find out which contigs ended up in which scaffold? And also, if I understand correctly the file '*network.gexf' has the orientation for each edge. Is there a way to also find, for each contig, to which strand it belongs (i.e. was the sequence of the contig converted to the reverse complement by MeDuSa)?

EBosi commented 7 years ago

Hi

But I see that all the distances I got are either 1.0 or 0.0 does that make sense?

can you paste the command line used here?

Also, is there a way to find out which contigs ended up in which scaffold? And also, if I understand correctly the file '*network.gexf' has the orientation for each edge. Is there a way to also find, for each contig, to which strand it belongs (i.e. was the sequence of the contig converted to the reverse complement by MeDuSa)?

I have to do that... I could merge both information in a single file, could you please tell me what do you think is the best way to do it? I'm looking at http://www.ebi.ac.uk/ena for usable formats

ShaiberAlon commented 7 years ago

Hi,

Thank you very much for your quick response! The command I used is: java -jar medusa.jar -f 01_FASTA/ -i p214_sequence-FASTA-estimated-distance.fa -d -v -gexf -o p214_sequence-FASTA-medusa-fixed-estimated-distance.fa

If you wish, I could also provide you with the files I used (but notice that I used 106 reference genomes).

As for the format of exporting the information on contig arrangement, I think that a table with the full information on the contigs would be best. I would suggest a table, where each row corresponds to the input fasta file, and with the following columns: contig_number, contig_name, scaffold_name, contig_orientation, contig_reverse_complimented

Where: contig_number: just a serial number from 1-N (N - the number of contigs in the input) according to the arrangement of the contigs in the final output fasta file contig_name: the name of the contig in the input fasta file scaffold_name: the name of the scaffold, in the output fasta, that contains the contig contig_orientation: 1 or -1 according to whether the orientation of the contig was kept or changed. contig_reverse_complimented: 1 or -1 according to whether the sequence nucleotides were kept or complimented.

In addition, if distance estimation is performed (-d), then I would suggest adding another column estimated_distance_to_next_contig (where the contigs at the end of a scaffold would just have a NaN or NA).

I hope this is helpful!

oschwengers commented 7 years ago

Hello, in addition to what @ShaiberAlon mentioned, coordinates of contig x in scaffold y (start, end) would be very helpful.

So maybe you could expand your output by a distinct and parsable (.tsv) file containing one line for each input contig:

contig_number, contig_name, scaffold_name, contig_orientation, contig_start, contig_end

Thanks a lot for this excellent tool!