EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
305 stars 69 forks source link

Using `int8_t` instead of `char` in `FM_METADATA` #318

Closed Augustin-Zidek closed 7 months ago

Augustin-Zidek commented 8 months ago

hmmer.h defines the FM_METADATA struct:

typedef struct fm_metadata_s {
  uint8_t  fwd_only;
  uint8_t  alph_type;
  uint8_t  alph_size;
  uint8_t  charBits;
  uint32_t freq_SA; //frequency with which SA is sampled
  uint32_t freq_cnt_sb; //frequency with which full cumulative counts are captured
  uint32_t freq_cnt_b; //frequency with which intermittent counts are captured
  uint16_t block_count;
  uint32_t seq_count;
  uint64_t char_count; //total count of characters including those in and out of the alphabet
  char     *alph;
  char     *inv_alph;
  int      *compl_alph;
  FILE         *fp;
  FM_SEQDATA   *seq_data;
  FM_AMBIGLIST *ambig_list;
} FM_METADATA;

The alph and inv_alph are typed as char, which could cause issues when people build HMMER with -funsigned-char.

E.g. the check in makehmmerdb.c:659:

if (meta->inv_alph[c] == -1) {

will be always false when HMMER is built with unsigned char.

Would it make sense to change the type of alph and inv_alph to int8_t (or signed char) which has more strictly defined behaviour?

Thanks!

cryptogenomicon commented 8 months ago

Yes, I agree, thanks for the heads-up.

Changing those from char to int8_t is the right thing to do, but it's not an easy change; that change has other knock-on effects. Instead, for now, I changed the relevant flags from values of -1 to 127 (which is a safe change, if slightly hacky) and left a bread crumb trail of comments for future refactoring.

Augustin-Zidek commented 8 months ago

Thanks!

Looking at the fix: Will hmmerdbs produced with v3.4 be incompatible with the next version that includes this fix?

cryptogenomicon commented 8 months ago

I believe they should be compatible. That code is only affecting error handling in converting sequence char symbols to digital codes (e.g. encoding 'A' as 0, 'C' as 1, etc.; unused ASCII chars got assigned a digital code of -1 for "unset"). Any valid index of real DNA sequences won't have invalid residue symbols.

(In checking into this, I found one more place I had to make a change for the new flag of 127.)

Augustin-Zidek commented 7 months ago

Thanks a lot!