gamcil / clinker

Gene cluster comparison figure generator
MIT License
507 stars 66 forks source link

ValueError: Distance matrix 'X' must be symmetric #14

Closed kforcone closed 3 years ago

kforcone commented 3 years ago

Hi,

I'm running Clinker on a server using .gbk files. I keep receiving the same error "ValueError: Distance matrix 'X' must be symmetric" and no output is created. I tried running it against just two of my files as well as all of them in a for loop. I haven't tried running it off the server yet, but if that could be the issue I'll definitely try.

I've attached an output file from the job submission (24299131.txt), my script (clinker_test.txt), and one of the files that it was run on (DSM_27508_154_fasta.phages_combined.gbk.txt) the rest of my files have the same format as well.

Thanks in advance!

24299131.txt clinker_test.txt DSM_27508_154_fasta.phages_combined.gbk.txt

gamcil commented 3 years ago

The file you provided has the LOCUS line length issue (https://github.com/gamcil/clinker/issues/9). Essentially, the GenBank parser can't read locus IDs longer than 16 characters. If I manually change it for each record in the file (i.e. from NZ_CP014796.1 Salipiger profundus strain JLT2016 chromosome, complete genome_fragment_1 to just NZ_CP014796.1), clinker can parse/align with no issue.

However, your error suggests the files get parsed in fine but don't align properly. Could you try renaming the LOCUS lines and trying clinker again, and report back?

gamcil commented 3 years ago

Okay, from further testing, this error occurs when clinker tries to create the cluster distance matrix but the cluster alignments themselves are empty. Could you try running clinker with the argument -i 0.0 and see if it can generate the plot? This drops the sequence identity % threshold to 0 so every gene-gene link is saved.

kforcone commented 3 years ago

I'm going to run it with -I 0.0 , but I'm not sure what that issue is, as it did not show up when running it now.

I changed the LOCUS line of each sequence in the .gbk files and ran a test and I just got it to work and create a graph:) the date on the LOCUS line needed to be changed as well otherwise I'd get another error that it didn't recognize the format of the LOCUS line. I'm uploading a file of the job output and of a reformatted .gbk file as well. I intend to write a script to reformat the LOCUS line in a loop since I have about 80 files total, and will post that when it's done if anyone has the same issue.

Output file of successful run: 24360882.txt

Reformatted gbk file: DSM_22007_104_fasta.phages_combined.gbk.txt

gamcil commented 3 years ago

Oh ok glad you got it resolved, unfortunately I can't do anything about the LOCUS line due to the BioPython parser, so I'll close this issue now.