eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

Number of redundant clusters doesn't match #90

Closed carolynzy closed 2 years ago

carolynzy commented 2 years ago

Hi,

I have got another problem with make_nr_matrix.pl.
As far as I understand, all the clusters in the log file will be in the last line of the tab file. So I checked the numbers of the redundant clusters in these two files.

With the same dataset used in #89 : I have found 2528 unique clusters in the log file and 2321 unique clusters in the last line of the tab file. So some redundant clusters were not added to the nr matrix file. I didn't use the -e option this time. How to explain this difference?

Thank you very much for your help! Always so efficient and helpful!

eead-csic-compbio commented 2 years ago

Hi @carolynzy , thanks again for reporting this. I will have a look. Can you please send the exact commands you ran? Thanks

eead-csic-compbio commented 2 years ago

In my tests I did:

../make_nr_pangenome_matrix.pl -m intersection_COG_OMCL_renamed/pangenome_matrix_t0.tab -n 2 -C 50 -P
...
# input matrix contains 9179 clusters and 421 taxa

# filtering clusters ...
# 9179 clusters with taxa >= 1 and sequence length >= 0
...
# 7768 non-redundant clusters
# created: intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.faa
# see log: intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.log

I then counted the unique redundant clusters:

cut -d" " -f 1  intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.log | sort -u | wc -l
1411    1411    6898

Note this number is correct (1411 + 7768 = 9179). However, the transposed matrix had a missleading number of empty rows at the end. For this reason the suggested oneliner to transpose the matrix is now (see https://github.com/eead-csic-compbio/get_homologues/commit/b2aaf8ed7b2da85df87cca21b6b44fb8b8aa81ed):

perl -F'\t' -ane '$F[$#F]=~s/\n//g;$r++;for(1 .. @F){$m[$r][$_]=$F[$_-1]};$mx=@F;END{for(1 .. $mx){for $t(1 .. $r){print"$m[$t][$_]\t"}print"\n"}}' \
   intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab

Let me know if this is clear, Bruno

carolynzy commented 2 years ago

@eead-csic-compbio Yes, I could get the same result using your command. But this doesn't solve my problem.

My confusion is about the redundant clusters in the _pangenome_matrix_t0_nr_t1_l0_e0_C50S90.tab file.

If my understanding is correct, the last line in the tab file should contain all redundant clusters, is it correct?

If that is the case, however, I could only find 1117 redundant clusters (after adding the redundant to the clusters kept in the final result) in the tab file. The command I used to calculate this number is:

tail -n1 pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab | sed 's/\t/\n/g' | grep ',' | wc

This could be confirmed by this command:

head -n1 pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tab|sed 's/\t/\n/g' | grep -P "+[0-9]{1}" | wc

And what's more confusing, the redundant clusters in the last line of the tab file don't match the clusters in the log file.

For example, the 3rd clusters in the tab file is "5_hypothetical_protein.faa+1" in the first line or "5_hypothetical_protein.faa,239533_hypothetical_protein.faa" in the last line. But I couldn't find 5_hypothetical_protein.faa in the log file, and 239533_hypothetical_protein.faa was redundant with another cluster:

less pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.log|grep -w 239533_hypothetical_protein.faa 4538 (1033016_hypothetical_protein.faa) is redundant with 2522 (239533_hypothetical_protein.faa) : 99.441 62

I'm quite confused about this.

Thank you for your time and help!

Yang

eead-csic-compbio commented 2 years ago

Thanks @carolynzy , it seems you have found a bug. I will look into this tomorrow, ok? See you

eead-csic-compbio commented 2 years ago

Hello again @carolynzy , I believe the commit https://github.com/eead-csic-compbio/get_homologues/commit/6f331fbe38d6079246bb0b5d0a6482801f0db616 fixes both issues:

1) The names of clusters in the log match those in the output matrix.

2) Now all redundant cluster names are added to the last column of the matrix. This required recursively working out all redundant nodes. The names in earlier versions were correct, but coulñd be incomplete in cases where A was redundant to B, and B redundant to C

Using your data I recreated the transposed matrix and obtained the expected number of redundant clusters:

perl -lane 'next if($F[$#F] eq "NA"); if($F[0] =~ /\+(\d+)/){ $red += $1} END{print $red}' intersection_COG_OMCL_renamed/pangenome_matrix_t0_nr_t1_l0_e0_C50_S90.tr.tab
1411

Please git pull and let me know how it goes, Bruno

carolynzy commented 2 years ago

Thank you! This update solved the problem. Now the number and clusters are both matched. Cheers!