dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

trim_loci not working for 3' R1 and 5' R2 #410

Open isaacovercast opened 4 years ago

isaacovercast commented 4 years ago

Somebody in gitter reported trim_loci wasn't working for the distal end of R1 and i checked and it is true that it's not doing anything. In debugging this I see this code in write_outfiles:

    def trim_param_trim_loci(self):
        "user entered hard trims"
        self.trims[0] = max([self.trims[0], self.data.params.trim_loci[0]])
        self.trims[1] = (self.trims[1] - self.data.params.trim_loci[5]
            if self.trims[1] else 0)
        self.trims[2] = (self.trims[2] + self.data.params.trim_loci[2]
            if self.trims[2] else 0)
        self.trims[3] = max([self.trims[3], self.data.params.trim_loci[3]])

self.trims is initialized as [0, 0, 0, 0], and I don't see anywhere in the code before the above function call where self.trims[1] or [2] would ever be set to anything besides 0.

On a side note, it's a little confusing that trim_reads and trim_loci are parameterized differently for the 3' ends of reads. trim_reads takes a length in bp to trim 3' to, and passing negative values trims that many bases from the end. trim_loci looks like postive values trim from the end and negative values don't appear to be accepted. It's a little confusing, they should probably both work the same way.

gscabanne commented 1 year ago

Hi, I am using ipyrad ver 0.9.84 and trim_loci does not work. I tried to trim six bp from 3' ends of loci (format 0, 6, 0, 0 ) and cut nothing. I also tried (0, -6, 0, 0 ), but the result was the same. I only tried this at step 7, which is supposed to act. Any way to solve this?

isaacovercast commented 1 year ago

Ooooh, I see what's happened. Finding the internal edges for paired data is still not implemented, so it's defaulting to not trimming those. The clue is the cryptic message and the ellipses and the fact that edges[1] and edges[2] are set to 0 here:

    def trim_for_coverage(self, minsite_left=4, minsite_right=4):
        "trim edges to where data is not N or -"

        # what is the limit of site coverage for trimming?
        minsamp_left = min(minsite_left, self.seqs.shape[0])
        minsamp_right = min(minsite_right, self.seqs.shape[0])

        # how much cov is there at each site?
        mincovs = np.sum((self.seqs != 78) & (self.seqs != 45), axis=0)

        # locus left trim
        self.edges[0] = locus_left_trim(self.seqs, minsamp_left, mincovs)
        self.edges[3] = locus_right_trim(self.seqs, minsamp_right, mincovs)
        if self.edges[3] <= self.edges[0]:
            self.bad = True

        # find insert region for paired data to mask it...
        self.edges[1] = 0
        self.edges[2] = 0

This is going to be a moderate amount of work, so it might not happen quickly, sorry to say.

isaacovercast commented 1 year ago

The trimming of the external edges happens on L750-753:

            # trim edges, need to use uppered seqs for maxvar & maxshared
            edg = self.edges[self.iloc]
            ublock = self.useqs[:, edg[0]:edg[3]]
            ablock = self.aseqs[:, edg[0]:edg[3]]

The trimming of internal edges isn't really trimming, but rather masking edges flanking the 'nnnn' insert between mate pairs (on L825):

                # mask insert in denovo data
                self.aseqs[:, edges.edges[1]:edges.edges[2]] = 110  # n
                self.useqs[:, edges.edges[1]:edges.edges[2]] = 78   # N

This isn't doing anything right now because edges[1] and edges[2] are always 0. The interpretation here is that the values in edges should be the location in bp to trim to, and that is indeed what trim_check() would do.